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The  classical  technique  of  conditional  logistic  regression  is  not  efficient  for 
analyzing  matched  case-control  data  when  some  exposure  variables  have  missing 
observations.  I present  a semiparametric  Bayesian  approach  to  analyze  this 
type  of  data  by  modeling  the  distribution  of  the  exposure  variable  which  may 
have  potentially  missing  values.  It  is  assumed  that  the  exposure  distribution  is 
a member  of  the  exponential  family.  The  novel  feature  of  this  method  is  that  it 
captures  unmeasured  stratum  heterogeneity  in  the  distribution  of  the  exposure 
variable  which  is  most  often  ignored  or  is  not  explicitly  modeled  in  this  context. 

I use  a Dirichlet  process  prior  on  the  stratum  heterogeneity  parameters,  and 
parametric  prior  on  all  other  parameters  of  our  interest,  and  estimate  them  through 
a Markov  chain  Monte  Carlo  integration  scheme.  One  attractive  feature  of  this 
method  is  its  data-adaptive  nature  for  capturing  true  heterogeneity  inherent  in  the 
data  sets.  The  method  is  illustrated  by  two  real  life  examples. 

Next  I consider  a matched  case-control  study  with  multiple  disease  states.  The 
probability  of  disease  development  is  described  by  a multinomial  logistic  regression 
model.  The  goal  of  the  study  is  to  assess  the  effect  of  potential  risk  factor  on  the 
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different  categories  of  the  disease.  As  before,  I assume  that  the  distribution  of  the 
exposure  variable  belongs  to  a general  exponential  family,  and  allow  for  possible 
unmeasured  stratum  heterogeneity.  The  proposed  method  is  extended  to  cases  in 
which  measurement  error  and  missingness  occur  simultaneously  in  an  exposure 
variable. 

Finally  I consider  association  modeling  among  multiple  exposure  variables  in  a 
matched  case-control  study  when  some  of  the  exposure  variables  may  be  partially 
missing.  I consider  two  different  types  of  association  models  accompanied  by  two 
real-data  examples.  Concluding  remarks  and  some  directions  for  future  research  are 
included  in  the  end. 
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CHAPTER  1 

SELECTED  REVIEW  OF  CASE-CONTROL  STUDIES 
1 . 1 Introduction 

The  goal  of  an  epidemiologic  study  is  to  find  causes  of  a disease  and  then 
assess  the  degree  of  association  between  the  disease  and  its  potential  risk  factors. 

To  collect  data  for  an  epidemiologic  study,  one  can  use  a retrospective  study 
design,  or  a prospective  study  design,  or  a cross-sectional  study  design,  or  a 
combination  of  all  three  study  designs  depending  upon  the  need  of  the  study. 

The  data  mainly  consist  of  a disease  outcome,  and  some  exposure  variables  along 
with  some  demographic  information  of  subjects.  Case-control  study  designs  are 
retrospective  designs,  where  one  first  observes  the  disease  outcome  of  a subject  and 
then  the  subject  is  followed  back  to  get  exposure  information.  Thus,  a case-control 
study  consists  of  a group  of  diseased  subjects  (cases)  and  a group  of  nondiseased 
subjects  (controls)  along  with  their  exposure  information.  Cases  and  controls  can 
be  randomly  chosen  from  a population,  or  by  using  stratified  random  sampling 
(Scott  and  Wild,  1986).  The  present  Chapter  basically  deals  with  the  historical 
development  of  how  to  analyze  case-control  data  with  exposure  variables,  which 
may  be  binary,  categorical  or  continuous. 

Quite  often  dataset  contains  missing  observations  on  some  exposure  variables. 
The  complete-case  analysis  which  naively  excludes  subjects  with  missing  exposure 
values,  produces  inefficient  estimates  of  the  parameters  of  interest  compared  to 
the  one  which  considers  all  the  subjects  even  if  some  subjects  may  have  some 
missing  observations.  In  the  following  Chapters  I propose  a new  method  for 
analyzing  matched  case-control  data  with  partially  missing  observations  on 
exposure  variables. 
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In  a matched  case-control  study,  controls  are  matched  with  a case  (or  several 
cases)  on  the  basis  of  some  matching  factors  (called  confounding  variables)  such 
as  age,  gender,  hospital  wards  etc.  Matching  is  important  if  exposure  variable  is 
strongly  associated  with  matching  variables;  otherwise  unmatched  and  matched 
analysis  of  a matched  case-control  data  will  produce  comparable  estimates  of  the 
parameters  of  our  interest. 

Consider  a case-control  study  with  a binary  exposure  variable  X and  disease 
indicator  D.  Suppose  one  observes  n\  cases,  and  no  controls,  and  they  are  classified 
according  to  a single  binary  exposure  variable  X (X  — 1 exposed,  and  X — 0 
unexposed);  so  from  the  sample,  the  exposure  odds  ratio  of  cases  to  controls  is 

P(X  = 1\D  = 1)P(X  = 0\D  = 0)  = nnnoo 
P(X  = 1| D = 0 )P(X  = 0 \D  = 1)  n10n0i ' 

where  nn,  and  nio  be  the  number  of  exposed,  and  unexposed  cases  while  noi  and 
n0 o be  the  number  of  exposed  and  unexposed  controls  in  the  sample.  Cornfield 
Table  1-1:  Case-control  data  with  a binary  exposure  variable 


Disease 

Status 

Exposed 

Not  Exposed 

Total 

Case 

nu 

nio 

ni 

Control 

n0 1 

n00 

no 

Total 

ei 

Co 

N 

(1951)  showed  that  disease  odds  ratio  with  exposure  X — 1,  relative  to  that 
for  X — 0 is  the  same  as  the  exposure  odds  ratio  for  a diseased  person  to  a 
nondiseased  person.  That  is, 

P(D  = 1\X  = 1 )P(D  = 0|X  = 0)  P{X  = 1| D = 1 )P{X  = 0|D  = 0) 

P(D  = 0|X  - 1 )P(D  = 1\X  = 0)  “ P{X  = l\D  = 0)P{X  = 0\D  = 1) ' 

For  a rare  disease,  both  P{D  — 0|X  = 0),  and  P(D  = 0|X  = 1)  are  approximately 
equal  to  1;  so  the  disease  odds  ratio,  say,  9(=ex p(/5) , say,  where  /3  denotes  the 
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log-odds  ratio  parameter),  becomes  the  relative  risk.  That  is 

n _ — 1\X  = 1)P(D  = 0|*  = 0)  P{D  = 1|*  = 1)  (12) 

P(D  = Q\X  = 1 )P{D  = 1\X  = 0)  ~ P{D  = 1\X  = 0) ' 

Note  that 

P(D  = 1\X  = 1)  P{D  = 1|X  = 0) 

_1  ^ P(D  - 0|X  = 1)  _ P{D  = 0|X  = 0) 

or,  P(D  = 0\X  = 1)=P(D  = 0\X  = 1). 

So,  an  odds  ratio  (relative  risk  for  rare  disease)  of  1 implies  that  there  is  no 
association  (i.e. , independence)  between  the  disease  and  the  potential  risk  factor; 
whereas  an  odds  ratio  other  than  1 implies  that  exposure  is  either  synergistic  or 
antagonistic.  It  is  easy  to  show  that  for  a large  sample,  a variance  estimate  of  log# 
is  (1/nn  + 1/nio  + l/n0i  + l/n00),  provided  that  none  of  the  cell  frequencies  is 
zero.  To  see  this,  let  7r  = (7Ti(l), 7t0(1))t,  where  7Tr(l)  — P(X  — 1| D = r ) and 
7Tr(l)  G (0, 1)  for  r = 0, 1 so  0(ir)  = 7r1(l)7ro(O)/7ri(0)7ro(l).  Using  the  Delta 
method, 

log(9(ir))  = log(9(ir))  + T:log9|^,x(*-'7r)  + op(-T) 

= ^w+U(i)(iU(i))+Mi)(iU(i))}(*-ir) 

For  large  samples,  log(9)  — log(9)  has  an  asymptotic  normal  distribution,  with  mean 
0 and  variance  (1/nn  + 1/nio  + 1/noi  + 1/noo)- 

1.2  Inferences  from  2x2  Table 


For  a small  sample  size,  inference  is  based  on  the  conditional  distribution  of 
paired  binomial  data  given  the  marginal  totals  (which  is  considered  as  ancillary 
in  the  sense  that  they  do  not  contain  any  information  about  the  parameter  of 
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interest).  So 


Pr(nii|ni,no,ei,eo;0) 


/ \ 

( \ 

n i 

no 

9n  11 

n"  J 

^ ei  - nn  J 

( \ 

t \ 

y; 

nx 

; no 

\ U ) 

l ei  - n ii  J 

(1.3) 


The  above  distribution  is  called  a non  central  hypergeometric  distribution.  To 
test  Hq\  9 = 90  against  K : 9 > dQ,  one  calculates  the  upper  tail  probability  under 
the  distribution  shown  in  Equation  (1.3), 


Vu='Yh  P{u\ni,no,ei,e0-,9o)  (1.4) 

u>n\i 

which  measures  the  degree  of  support  for  H0.  Similarly  to  test  Ho  against  K : 

9 < 9q  we  should  calculate  lower  tail  probability.  The  above  test  is  known  as 
Fisher’s  exact  test.  Altham  (1969)  gave  a Bayesian  interpretation  of  (1.3). 

Table  1-2:  Series  of  2 x 2 table  for  case-control  data 


Disease 

Status 

Exposed 

Not  Exposed 

Total 

Case 

^lli 

nioi 

nu 

Control 

^Oli 

nooi 

^0  i 

Total 

Cl  i 

COi 

Ni 

Suppose  now  we  have  a series  of  2 x 2 tables,  and  the  odds  ratio  is  same  (say 
9)  across  the  series,  and  we  want  to  test  Ho  : 9 — 1.  Under  Hq , the  expectation 
and  variance  of  cell  count  for  exposed  cases  in  the  ith  table  will  be  E{nm\9  — 1)  = 
nueu/Ni:  Var(nni]9  = 1)  = (eiie0iniin0i)/A'f(Ai  - 1),  and  the  test  statistic 
(Cochran  1954,  Mantel  and  Haenszel  1959)  is 

2 {lELiKii  - E(nm\0  = !))|  - §} 

Y,LiVar(nni\8  = 1) 


(1.5) 


5 


1.3  Test  for  Homogeneity  of  Odds  Ratio 
Mantel  and  Haenszel  (1959)  proposed  a summary  odds  ratio  estimate  (relative 
risk)  across  the  tables 


' mh 


X^i=l  'R'lli'fl'OOi / Ni 
nonnioi/Ni 


(1.6) 


The  statistic  given  in  (1.5)  is  unaffected  by  zero  cell  counts,  and  consistently 
estimates  the  common  odds  ratio.  In  a similar  way,  one  can  also  test  for  homo- 
geneity of  odds  ratios  across  the  tables.  One  calculates  the  expected  frequency  and 
variance  of  exposed  cases  under  9 — 9mh,  and  then  construct  the  test  statistic  as 
Xa=i{niii  — E(niii\Omh)}2 /Var(niu\6mh)-  Under  H0 , the  above  statistic  has  an 
approximate  x2  distribution  with  I — 1 degrees  of  freedom.  The  above  formulation 
could  be  extended  to  K(>  2)  exposure  levels  (Armitage  and  Berry,  1994)  which 
was  further  extended  to  incorporate  stratification.  Mantel  and  Haenszel  did  not 
provide  any  variance  estimator  for  their  proposed  odds  ratio  estimator.  Robins, 
Breslow  and  Greenland  (1986)  and  Phillips  and  Holland  (1987)  proposed  variance 
estimator  covering  two  different  situations:  ( i ) a large  number  of  tables  with  small 
frequencies,  and  ( ii ) a small  number  of  tables  with  large  frequencies.  The  key 
observation  is  E(Ri)  = OiE(Si),  where  S;  — nninmi/Ni , S';  = noiiTRoi/A;,  where  0; 
is  the  odds  ratio  for  table  1-2.  Assuming  a common  value  of  0;,  say  9,  9mh  is  the 
solution  of  the  unbiased  estimating  equation  R — 9S  = 0,  where  R = Xw=i  and 
S = ]Tb=1  Si . Now  with  one  step  Taylor’s  expansion, 


log{6mh ) = log{9)  + 
= log{6)  + 


= log (9)  + 


R-9S 

9S 

R-9S 
9E(S) 
R-  OS 
E(R ) 


+ °p( 

+ op( 


+ °p( 


Var(R)  Var(S) 


+ 


E2(R)  E2(S ) ' 
Var(R)  Var(S) 
E2(R)  + E2(S)  ' 
Var(R)  Var(S) 
E2(R)  E2(S) 
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Under  paired  binomial  sampling,  the  variance  of  the  individual  contribution  to  this 
estimating  equation  satisfies 


N?Var(Ri  - 9 Si) 


2^'[{nllin00i  + Qn01in10i} 


+ n00i  + Q(n01i  + n10i)}]- 


Combining  the  above  two  results,  one  gets 


V ar{log{9mh)) 


Var(R  - 9S) 
E2(R) 


Eli  Var(Ri  - 9Sj ) 
E2(R) 


Hence,  Var(log(9mh ))  = Ei=i  Var(Ri  - 9Si)/R2. 

1.4  Matched  Pairs 


I have  already  mentioned  the  role  of  matching  for  controlling  confounding 
factors.  To  avoid  gross  imbalance  in  case-control  ratio  across  strata,  matching 
could  be  introduced  at  the  design  stage  of  a study.  The  simplest  situation  of 
matched  data  arises  when  one  case  is  matched  with  a control,  and  they  are 
categorized  on  the  basis  of  a binary  exposure. 

Table  1-3:  1:1  matched  case-control  data  with  a binary  exposure  variable 


Case 

Control 

Exposed  Unexposed 

Exposed 

Unexposed 

mn  mio 

moi  moo 

Suppose  one  has  mu,  m10,  m0i,  and  m0o  matched  pairs  under  different 
categories.  Let  ir  be  the  probability  of  observing  a matched  pair  with  an  exposed 
case  and  unexposed  control  given  a discordant  pair.  Then 

P(X  = 1|D  = 1)P(X  = 0|D  = 0)  0 

n ~ P(X  = 1| D = 1 )P(X  = 0| D - 0)  + P(X  = 0| D = 1 )P(X  - 1| D - 0)  “ 6 + 1' 

(1.7) 

The  conditional  distribution  of  mio  given  the  total  number  of  discordant  pairs  is 
binomial  with  parameter  (mio  + moi,  7r).  So  the  MLE  of  9 is  mw/moi  which  is 
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also  the  Mantel  Haenszel  estimator  of  the  common  odds  ratio  parameter.  Note  that 
when  9 — 1,  -k  — 1/2.  Hence  the  test  statistic  to  test  H0  : 9 = 1 is 

2 _ (|m10  - £7f0(miolm10  + m01)\  - §)2  ^ ^ 

* varHo{mw\mi0  + m0i) 


This  is  known  as  McNemar’s  (1947)  test.  One  of  the  potential  problem  of  this 
estimator,  and  this  test  is  that  it  uses  only  the  discordant  pairs  of  observation  and 
discards  the  concordant  set. 

In  case  of  1 : M matching,  the  Mantel-Haenszel  estimator  of  common  odds 
ratio  is 


~ r + l)mir_i 


9 mh 


Er=l  rm0 r 


(1.9) 


where  mu  is  the  number  of  matched  sets  where  the  case  and  s controls  are 
exposed;  and  m0s  is  the  number  of  matched  sets  where  the  case  is  unexposed  but  s 
controls  are  exposed.  The  test  statistic  for  testing  H0  : 9 = 1 is 


X 


rtr  N 
M+l' 


i}: 


T-+  (M~r+1) 

Z^r= 1 ' (M+ 1)2 


(1.10) 


where  tr  — m[r  + mor. 

The  log-odds  ratio  regression  concerns  the  effect  of  a binary  exposure  on 
the  risk  of  a disease.  Multiple  categorical  risk  factors  could  be  considered,  but 
only  by  stratifying  with  respect  to  the  other  factors.  Methods  for  considering 
simultaneous  effects  of  several  factors  were  initiated  in  1961  with  a linear  logistic 
model  (Cornfield  et  ah,  1961).  Later  Cox  (1989)  developed  the  theory  of  logistic 
regression  in  more  detail. 

1.5  Logistic  Regression  in  Case-Control  Studies 
Suppose  that  the  covariate  z is  an  arbitrary  p— vector  with  zq  as  reference 
category.  An  odds  ratio  model 


P{D  = 1| z)P{D  - 0| z0){P(D  = 0| z)P{D  = llzo)}"1  = exp{/3r(z  - z0 )}  (1.11) 


is  equivalent  to  a binary  logistic  failure  probability  model 


P(D  = l\z)  = H{{a  + (3T(z-z0))}  (1.12) 

(Prentice  and  Pike,  1979),  where  H(x)  = 1/(1  + e~x ) is  the  logistic  distribution 
function,  and  a — log{Pr(D  = 1| z0)/Pr(D  — 0|z0)}  is  the  log-odds  for  the  disease 
given  the  reference  value  Zq.  The  covariate  density  model  also  takes  the  logistic 
form,  i.e. , 

P(Z\D  = 1)  = P{z0\D  = 1)  exp[log{  =-°Qj fpT(z  ~ zo)\-  (1-13) 

Let  us  again  consider  the  unmatched  scenario  of  n\  cases  and  n0  controls, 

(n  — n o + n\)  but  with  an  arbitrary  exposure  variable  2.  The  retrospective 
likelihood  function  for  this  unmatched  design  is 

L=  H P(z:\D  = l)x  P(Zj\D  — 0).  (1.14) 

j ‘.cases  j -.controls 

Let  S'  be  a sampling  indicator,  which  indicates  whether  an  individual  is  selected 
in  the  case-control  sample  (S  = 1)  or  not  (S  = 0).  Since,  conditional  on  disease 
status,  sampling  is  independent  of  exposure,  using  Bayes’  theorem, 


P(z\D  = i)  = P(z\D  = i,S  = 1) 

P{D  = i\z,  S — l)P(z\S  — 1) 

P(D  = i\S  = 1) 

= P(D  = i\z,S  = l)P(z\S  = l)  — . (1.15) 

Tli 


As  in  Mantel(1973),  one  can  obtain  the  conditional  probability  of  a person  being  a 
case,  given  he  has  exposure  z,  and  that  he  was  sampled  for  the  case-control  study 


as 


P{D  = 1|  z,S  = 1)  = 


P{S  = 1|  D = 1 )P(D  = l|z) 
ZloP(S  = l\D  = i)P{D  = i\zy 


(1.16) 
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Here  I used  the  fact  that  sampling  is  independent  of  exposure  within  cases  and 
controls.  Inserting  (1.12)  into  (1.16),  and  using  the  relationship  Pr(D  — 1|S  = 

1 )/Pr(D  = 0|5  = 1)  = ni/n0  one  obtains 


Note  that  5 - a — log(ni/n0)  — log{P(D  = 1 )/P{D  — 0)},  which  is  the  log  odds 
ratio  of  the  disease  probabilities  in  the  sample  to  those  in  the  entire  population. 

So  far  I have  shown  that,  starting  with  a prospective  logistic  regression  model 
for  disease  incidence  during  a defined  time  interval,  one  again  arrives  at  a logistic 
regression  model  for  the  retrospective  probability  of  drawing  a case  from  case- 
control  sample.  The  log-odds  ratio  parameter  (3 , is  the  same  in  both  formulations, 
while  the  intercept  parameters  5 , and  a are  different.  While  a depends  only 
on  disease  incidence  for  the  reference  value  z0,  5 is  also  a function  of  the  mean 
incidence  in  the  population,  and  of  the  ratio  of  cases  and  controls  in  the  study 
sample. 

Substituting  (1.15)  in  L one  obtains 


P{D  = l\z,  S = 1)  = H{5  + (3t{z  - z0)}, 


(1.17) 


where 


(1.18) 


L ex  Li  x L2, 


with  the  constraint  rii/n  = f P{z\S  — 1 )P(D  = i\Z.  S — 1 )dz,  where 


Ly  = Yl  P{D  = l\Zj,S  = 1)  x P{D  = 0\Zj,S  =1).  (1.19) 


j ‘.cases 


j:  controls 


1 nd 


^=nn?ws=i) 


d=0j=l 


and 


10 


The  maximum  likelihood  estimates  (MLE)  (3  of  the  log-odds  ratio  parameter 
obtained  by  maximizing  L would  be  the  same  as  the  MLE  obtained  from  L\ 
(Prentice  and  Pyke,  1979).  It  can  be  shown  that  the  MLE  obtained  from  Li  is 
consistent  for  the  log-odds  ratio  parameter  is  asymptotically  normal. 

So  far  I discussed  logistic  regression  in  the  context  of  unmatched  studies.  Next 
focus  on  logistic  regression  in  the  matched  case-control  setup.  In  the  simplest 
setting,  the  data  consist  of  s strata  and  there  are  Mi  controls  matched  with  a case, 
for  stratum  Si,  i = 1,  • • • , s.  As  before,  I assume  a prospective  logistic  incidence 
model  for  disease 

P(D  = l\z,  Si)  = H{oi  + [3r(z  - z0)},  (1.20) 

where  oti  s are  stratum  specific  intercept  term.  Proceeding  as  in  the  unstratified 
case,  the  retrospective  logistic  model  becomes 


P(D  = l\z,  Si,  S = 1)  = H{/30(Si)  + (3 T{z  - z0)},  (1.21) 


where 


Po  (Si)  = log{ 


1 x P{D  = 0\Si) 
Mi  x P(D  = l|5j) 


P(D  = l\z0,Sj) 
P(D  = 0\zo,Siy 


(1.22) 


Assuming  that  the  1st  subject  in  each  stratum  is  a case  and  rest  of  the  subjects  are 
controls,  the  likelihood  under  this  retrospective  sampling  is 


s Mi 

L = Y[{P(zil\D  = l,Si)Y[P(zij\D  = 0,Si)} 

i=i  j= i 

s Mi 

°c  H{P(D  ^llz^S^S  = l)l[P(D  = 0\z.lj,Si,S  ^l)}.  (1.23) 

i=  1 j= 1 

The  above  likelihood  involves  (3  and  a set  of  nuisance  parameters  0o(Si)  for 
i — 1,  ■ ■ • , s which  grows  in  direct  proportion  to  the  number  of  strata.  This  gives 
rise  to  the  well  known  Neyman-Scott  phenomenon  where  the  MLE  turns  out  to  be 
inconsistent.  Conditioning  on  the  sufficient  statistics  i”1  ^ij  0 f aii  one  obtains 
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the  conditional  likelihood 


s Mi~\~  1 

l°  = n n p(Dn\zij,Si,s = i,  y]  = i) 

*=i  j=  i j=i 

JT.  exp(/3Tzu) 


(1.24) 


iiE^exp^)’ 
assuming  that  in  each  stratum  j — 1 stands  for  case.  The  above  method  is  known 
as  conditional  logistic  regression  (CLR). 

The  advantage  of  the  above  likelihood  is  (a)  it  involves  only  the  relative  risk 
parameters,  and  (b)  one  obtains  the  same  form  of  likelihood  as  either  (i)  the  data 
arising  from  a prospective  study  of  Ei=i  Mt  + s individuals,  the  conditioning 
event  being  the  observed  number  s of  cases  arising  in  the  sample;  or  (ii)  case- 
control  study  involving  s cases  and  Ei=i  Mi  controls,  the  conditioning  event  being 
Ei=i  Mi  + s exposure  histories. 

Note  that  if  x is  some  matching  variable,  then  its  contribution  to  the  condi- 
tional likelihood  is  zero;  so  one  can  not  estimate  the  relative  risk  corresponding  to 
the  matching  variable.  One  can  consider  an  interaction  term  of  a matching  variable 
with  other  covariates,  and  investigate  the  change  of  the  relative  risk  parameter 
across  strata  (Greenland  and  Robins,  1985). 

Prentice  and  Breslow  (1978)  gave  a conceptual  foundation  for  case-control 
study  deriving  conditional  likelihood(CL)  from  failure  time  consideration.  One 
of  the  main  drawback  of  CL  is  that,  if  there  is  some  missing  information  on  an 
exposure  variable  of  a subject,  the  available  partial  information  on  that  subject  is 
ignored  from  the  CLR.  For  a 1 : 1 matched  case-control  study,  missing  observa- 
tion on  an  exposure  of  a subject  leads  to  deletion  of  the  entire  stratum  from  the 
CLR.  In  that  situation,  one  only  uses  the  strata  which  have  complete  information 
for  all  the  subjects,  and  lose  information  on  strata  containing  any  missingness. 
Substantial  amount  of  work  is  available  on  the  issue  of  missing  exposure.  There 
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are  two  basic  approaches  to  handle  this  problem.  One  is  to  model  the  distribution 
of  exposure  variable  and  the  other  is  to  model  the  missingness  process.  Lawless 
et  al.  (1999)  proposed  a semiparametric  method  to  deal  with  the  exposure  dis- 
tribution. Paik  and  Sacco  (2000)  modeled  the  distribution  of  exposure  variables 
parametrically,  and  then  used  pseudo  conditional  likelihood  method  to  estimate 
the  parameters.  Lin  and  Ying  (1993)  considered  missing  covariate  values  in  Cox’s 
model.  Other  studies  address  the  issue  of  misclassification  of  exposure  variables 
and  mismeasured  covariates  (Carroll  et  al.,  1993;  Carroll  et  al.,  1995;  Dahm  et  ah, 
1995),  as  well  as  sample  size  determination. 

1.6  Bayesian  Analysis  of  Case-Control  Studies 
So  far  we  discussed  case-control  studies  in  frequentist  framework.  Now  I 
focus  on  Bayesian  analysis  of  case-control  problems.  Zelen  and  Parker  (1986), 
Nurminen  & Mutanen  (1987),  Marshall  (1988),  and  Ashby  et  al.  (1993)  addressed 
the  Bayesian  method  for  case-control  study  with  only  a single  binary  exposure 
variable.  All  of  them  used  the  following  model  for  a binary  exposure  variable. 

Let  (j)  and  7 be  the  probabilities  of  exposure  in  control  and  case  populations 
respectively.  The  retrospective  likelihood  is 

7)  oc  <^noi(l  - 0)noo7nil(l  -7)"10,  (L25) 

where  noi  and  noo  are  the  number  of  exposed  and  unexposed  observations  in  a 
control  population;  whereas  nn  and  ni0  denote  the  same  for  a case  population. 

Independent  conjugate  prior  distributions  for  (f>  and  7 are  assumed  to  be 
Beta(ui,U2)  and  Beta{v\,V2)  respectively.  After  reparametrization,  one  obtains  the 
posterior  distribution  of  the  log  odds  ratio  parameter,  f3  = £05(7(1  — $)/(/>(  1 — 7)} 
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as 


p{P\nn,n1o,noi,n00)  oc  exp{(nn  + vi)f3} 


L 


1 A)ni 


X 


The  posterior  density  of  (3  does  not  exist  in  closed  form,  but  may  be  evaluated  by 
numerical  integration. 

Since  interest  often  lies  in  the  hypothesis  (3  = 0,  Zelen  and  Parker  (1986) 
recommend  calculating  the  ratio  of  the  two  posterior  probabilities  p(/3)/p(0 ) at 
selected  deviates  f3.  When  (3  is  set  at  the  posterior  mode,  a large  value  of  this 
ratio  will  indicate  concentration  of  the  posterior  away  from  0 and  one  would  infer 
disease-exposure  association.  However  the  critical  value  suggested  for  this  ratio  is 
completely  arbitrary.  They  also  provide  a normal  approximation  to  the  posterior 
distribution  of  (3  to  avoid  numerical  computation.  Zelen  and  Parker  (1986)  also 
discuss  the  problem  of  choosing  a prior  distribution  based  on  some  prior  data  on 
exposure  information.  They  also  raise  the  possibility  of  performing  a case-control 
analysis  without  the  presence  of  a control  sample,  if  adequate  exposure  information 
is  available  from  prior  studies. 

Nurminen  and  Mutanen  (1987)  considered  a more  general  parameterization 
in  terms  of  the  odds  ratio  R — exp(/3)  which  covers  risk  ratio  and  risk  differences. 
They  provided  a complicated  exact  formula  for  the  cumulative  density  function 
of  this  general  comparative  parameter.  The  formula  can  be  related  to  Fisher’s 
exact  test  for  comparing  two  proportions  in  sampling  theory.  The  Bayesian  point 
estimates  are  considered  as  posterior  median,  and  mode,  whereas  inference  is  based 
on  highest  posterior  density  interval  for  the  comparative  parameter  of  interest. 

Marshall  (1988)  provided  a closed-form  expression  for  the  moments  of  the 
posterior  distribution  of  the  odds  ratio.  He  mentioned  that  an  approximation  to 
the  exact  posterior  density  of  the  odds  ratio  parameter  can  be  obtained  by  power 
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series  expansion  of  the  hypergeometric  functions  involved  in  the  expression  for  the 
density  but  acknowledges  the  problem  of  slow  convergence  in  adopting  this  method. 
Instead  Marshall  used  Lindley’s  (1964)  result  for  the  approximate  normality  of 
log(i?)  which  works  very  well  over  a wide  range  of  situations.  In  the  absence  of 
exposure  information,  Marshall  recommended  using  independent  priors  on  the 
parameters.  He  suggested  that  a perception  about  the  value  of  the  odds  ratio 
should  guide  the  choice  of  prior  parameters  rather  than  attempting  to  exploit  the 
exposure  proportions  as  suggested  in  Zelen-Parker.  Inference  again  is  based  on 
posterior  credible  intervals. 

Muller  and  Roeder  (1997)  presented  a semiparametric  Bayesian  approach  to 
case-control  studies  having  continuous  exposures  with  measurement  error.  The 
outline  of  their  approach  is  as  follows: 

Let  D be  a binary  response  observed  together  with  covariates  X,  and  Z, 
where  Z may  be  vector- valued.  The  retrospective  case-control  sample  is  chosen 
by  selecting  n i cases  with  D = 1 and  n0  controls  with  D — 0.  The  error  in 
variables  occurs  when  for  a subset  of  the  data-  reduced  data  or  (1Z)  from  now 
on-  a surrogate  W is  measured  instead  of  the  true  covariate  X.  For  a typical 
validation  study,  for  the  remaining  subjects-  complete  data  (C)  from  now  on-one 
has  observations  on  both  the  gold  standard  X and  the  surrogate  W . 

The  approach  proposed  by  Miiller  and  Roeder  (1997)  is  based  on  a nonpara- 
metric  model  for  the  retrospective  likelihood  of  the  covariates,  and  exposure.  For 
reduced  data,  this  nonparametric  distribution  models  the  joint  distribution  of 
( Z,W ) and  the  missing  covariate  Xr.  For  complete  data,  the  same  distribution 
models  the  joint  distribution  of  the  observed  covariates  ( Xc , Z , W).  The  nonpara- 
metric distribution  is  a class  of  flexible  mixture  distributions,  obtained  by  using 
mixture  of  normal  models  with  a Dirichlet  Process  prior  on  the  mixing  measure 
(Escobar  and  West,  1995).  For  the  prospective  disease  model,  relating  disease  to 
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exposure  is  taken  to  be  a logistic  distribution  characterized  by  a vector  of  param- 
eters f3.  Also,  let  6 denote  the  parameters  describing  the  marginal  distribution  of 
the  exposure  and  covariates.  Under  certain  conditional  independence  type  assump- 
tions and  nondifferential  measurement  error,  the  retrospective  likelihood  is  shown 
to  be  compatible  with  the  prospective  model.  With  a prior  p((3,  0),  one  obtains  the 
joint  posterior  as, 


By  augmenting  the  parameter  with  the  latent  vector,  one  can  write  the  joint 
posterior  of  parameters  and  missing  exposure  XR  as 


The  above  likelihood  can  be  factorized  by  using  Bayes  theorem  under  the  ad- 
ditional assumption  of  nondifferential  measurement  error:  p(D \X,  Z,  W,  (3)  — 


Then  the  following  mixture  model  with  a multivariate  normal  kernel  (f>$.  is  as- 
sumed: 


n 


p(D\X,Z,(3),  as 


n 


(1.29) 


p{Xi,Zi,Wi\9i)  = ^{X^ZuWi) 


(1.30) 


di  ~ G 


G ~ DP(aGo). 
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Where  9i  =(/ij,  £*),  the  parameters  of  the  trivariate  normal  kernel,  and  DP{aG o) 
denotes  a Dirichlet  process  with  a base  measure  G0,  and  concentration  parameter 
a.  The  hierarchy  is  completed  by  assuming  a normal/Wishart  hyper-prior  on 
the  parameters  of  the  base  measure  Go  = iV(/zo,  Eo),  and  a Gamma  prior  on 
the  concentration  parameter  a.  A noninformative  prior  is  assumed  on  f3.  A 
Markov  chain  Monte  Carlo  numerical  integration  scheme  is  designed  for  estimating 
the  parameters.  The  computation  scheme  becomes  either  very  expensive  or 
inaccurate  when  the  dimension  of  the  ( X , Z,  W)-space  increases  as  one  must 
then  evaluate  p(Dj\(3,  0)  by  numerical  integration  over  the  (X,  Z,  W)-space.  This 
paper  breaks  many  new  grounds  in  Bayesian  treatment  of  case-control  studies 
including  consideration  of  continuous  covariates;  measurement  error;  and  flexible 
nonparametric  modeling  for  exposure  distribution  leading  to  robust,  interpretable 
results  quantifying  the  effect  of  exposure. 

Muller  and  Roeder  point  out  that  categorical  covariates  require  a different 
treatment,  assuming  a multivariate  normal  kernel  for  the  Dirichlet  Process  im- 
plicitly assumes  continuous  exposures.  Seaman  and  Richardson  (2001)  extend  the 
binary  exposure  model  of  Zelen-Parker  to  any  number  of  categorical  exposures,  by 
simply  replacing  the  binomial  likelihoods  in  (1.25)  by  a multinomial  likelihood,  and 
then  adopting  a MCMC  strategy  to  estimate  the  log  odds-ratio  for  the  disease  in 
each  category  with  respect  to  a baseline  category.  The  set  of  multinomial  prob- 
abilities corresponding  to  exposure  categories  in  case  and  control  populations  is 
assumed  to  have  a discrete  Dirichlet  distribution  prior.  The  prior  on  the  log-odds 
ratios  could  be  chosen  to  be  anything. 

Seaman  and  Richardson  also  adapted  the  Miiller-Roeder  approach  to  cate- 
gorical covariates  by  simply  modeling  the  multinomial  probabilities  corresponding 
to  exposure  categories  in  the  entire  source  population  as  opposed  to  looking  at 
case  and  control  populations  separately  as  in  the  generalized  Zelen-Parker  binary 
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model  described  above.  A noninformative  Dirichlet  (0,0, ..,0)  prior  is  assumed  on 
the  exposure  probabilities.  Seaman  and  Richardson  established  the  equivalence  of 
the  generalized  binary  and  adapted  the  Miiller-Roeder  approach  in  some  limiting 
cases  with  uninformative  Dirichlet  (0,0, ..,0)  prior  on  0. 

Continuous  covariates  were  treated  in  Seaman  and  Richardson  (2001)  frame- 
work by  discretizing  them  into  groups.  Little  information  is  lost  if  discretization 
is  sufficiently  fine.  This  opens  up  the  possibility  of  treating  continuous  and  cat- 
egorical covariates  simultaneously  which  is  not  obvious  in  the  Miiller-Roeder 
approach. 

Muller  et  al.  (1999)  proposed  a hierarchical  Bayesian  approach  for  combining 
case-control  study  and  a prospective  cohort  study,  and  to  estimate  the  absolute  risk 
of  the  disease.  They  modeled  retrospective  distribution  of  the  exposure  variable 
given  the  disease  status,  and  accounted  for  parameter  heterogeneity  across  studies 
by  using  a hierarchical  Bayesian  approach.  A mixture  model  is  assumed  for  the 
exposure  variable.  Parameters  were  estimated  via  Markov  chain  Monte  Carlo 
integration  scheme. 

Diggle,  Morris  and  Wakefield  (2000)  presented  the  first  Bayesian  analysis 
for  cases  individually  matched  to  the  controls  (resulting  nuisance  parameters  are 
introduced  to  represent  the  separate  effect  of  matching  in  each  matched  set) . They 
considered  matched  data  when  exposure  of  primary  interest  is  defined  by  the 
spatial  location  of  an  individual  relative  to  a point  or  line  source  of  pollution. 

The  basic  model  of  Diggle  et  al.  (2000)  for  an  unmatched  set-up  is  as  follows. 
Let  locations  of  individuals  in  the  population  at  risk  follow  an  inhomogeneous 
Poisson  process  with  intensity  function  g(x ),  and  the  probability  that  an  individual 
at  location  x becomes  a case  is  p*(x).  Then  the  odds  of  disease  among  sampled 
individuals  are 


r(x)  = {a/b)p*(x)/(l  -p*{x)) 
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where  a and  b are  the  sampled  proportions  of  cases  and  controls.  Further,  r(x)  is 
modeled  as  a nonlinear  logistic  regression  model,  namely, 

r(x)  — ph(x,  9) 

where  h(x,  9)  involves  interpretable  parameters  of  interest  9.  The  extension  of  this 
model  to  J individually  matched  case-control  pairs  in  a study  region  is  made  in 
the  following  way.  Suppose,  the  J locations  for  cases  and  controls  are  denoted  by 
Xoj,  and  X\j  respectively,  j=l,..,J.  The  probability  of  disease  for  an  individual  at 

location  x in  stratum  j is  modeled  as: 

„ _ r?(Q  _ PiH*<e) 

Po\x)  — i +Tj(x)  ~ i+Pjh(x,o)’ 

where  pj  are  varying  baseline  odds  between  matched  pairs  and  are  considered  as 
nuisance  parameters.  Conditioning  on  the  unordered  set  of  exposures  within  each 
matched  set,  one  has  the  conditional  probability  of  disease  of  an  individual  at 
distance  x in  stratum  j 


Pc(xJ0 ) - h(x3-oi)+h(^i,e)  • 

In  case  of  1 : M matching  in  each  stratum,  and  in  the  presence  of  q additional 
covariates  Zk{xji ) measured  for  the  i-th  individual  in  j-th  stratum  at  location  Xji, 

* = !,•••  , M\j  = !,■•■  , J;  k — 1,  • • • ,q,  the  full  fledged  conditional  probability  can 


be  modeled  in  the  form 


/ \ _ h(:rj0.fl)<sxp(ELi 

Pc{  j 01  ES Kxjifi)  exp(ELi  ’ 

with  the  conditional  log  likelihood  being  of  the  form 

j 

L(6,<t>)  = ^2\ogpc(xj0). 
i 

Diggle  et  al.  (2000)  considered  this  likelihood  as  well  as  Bayesian  approach  for 
estimation  of  parameters  and  inference  in  this  model  with  specified  form  of  f(x,9). 
Often  in  studying  environmental  pollution,  there  is  a putative  point  source  at 
location  x*,  and  interest  is  on  how  the  risk  surface  changes  in  relation  to  x* . Let 
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d = \\x  — a;*  ||  be  the  distance  of  location  x from  the  source.  In  such  instances  they 
propose  h of  the  form 

h(x)  = 1 + a exp(— (d/fi)2),  for  fixed  x* . 

This  model  has  a natural  interpretation,  with  a being  the  proportional  increase 
in  disease  odds  at  the  source,  while  (5  measures  the  rate  of  decay  with  increasing 
distance  from  the  source  in  units  of  distance.  The  standard  likelihood  inference 
for  such  models  based  on  the  likelihood  ratio  statistic  may  not  be  reliable,  as 
the  likelihood  involved  is  highly  nonregular.  The  Bayesian  paradigm  provides  an 
alternative  when  likelihood  surface  is  highly  irregular.  In  their  paper  independent 
normal  priors  are  assumed  on  the  covariate  related  regression  parameters  4>,  and 
independent  uniform  priors  are  chosen  for  a and  (5.  Estimation  is  carried  out  by 
adopting  an  MCMC  technique,  and  inference  is  conducted  by  computing  the  Bayes 
factor  for  testing  the  null  hypothesis  h(x)  — 1 which  amounts  to  saying  that  the 
risk  of  the  disease  does  not  depend  on  the  distance  from  the  putative  source. 

Ghosh  and  Chen  (2002)  developed  general  Bayesian  inferential  techniques  for 
matched  case-control  problems  in  the  presence  of  one  or  more  binary  exposure 
variables.  Their  model  was  more  general  than  that  of  Zelen  and  Parker  (1986). 

Also  their  analysis,  unlike  that  of  Diggle,  Morris,  and  Wakefield  (2000),  was  based 
on  an  unconditional  likelihood  rather  than  a conditional  likelihood  after  elimination 
of  nuisance  parameters.  Any  methodology  based  on  the  conditional  likelihood  is 
applicable  only  for  the  logit  link.  The  general  Bayesian  methodology  based  on 
the  full  likelihood  as  proposed  by  Ghosh  and  Chen  worked  beyond  the  logit  link. 
Their  procedure  included  not  only  the  probit  and  the  complementary  log  links 
but  also  some  new  symmetric  as  well  as  skewed  links.  The  propriety  of  posteriors 
was  proved  under  a very  general  class  of  priors  that  need  not  always  be  proper. 
Also,  some  robust  priors  (such  as  multivariate  t-priors)  were  used.  The  Bayesian 
procedure  was  implemented  via  Markov  chain  Monte  Carlo. 
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To  illustrate  their  general  methodology,  Ghosh  and  Chen  considered  the 
Breslow  and  Day’s  (1980)  the  Los  Angeles  (LA)  cancer  data,  and  studied  the  effect 
of  gall-bladder  disease  on  the  risk  of  endometrial  cancer  taken  as  a single  exposure 
variable;  and  both  gall-bladder  disease  and  hypertension  as  multiple  exposure 
variables.  For  situations  with  one  case,  and  one  control,  it  was  shown  that  the 
Bayesian  procedure  could  yield  conclusions  different  from  those  of  conditional 
likelihood  regarding  association  between  the  exposures  and  the  disease.  This  is 
because,  while  the  conditional  likelihood  ignores  all  concordant  pairs,  the  Bayesian 
procedure,  based  on  the  full  likelihood  also  uses  these  pairs. 

Seaman  and  Richardson  (2004)  showed  the  equivalence  of  prospective  and 
retrospective  analysis  of  case-control  studies  in  a Bayesian  analysis.  They  showed 
that  the  posterior  distribution  of  the  log-odds  ratio  obtained  from  a prospective 
and  a retrospective  likelihood  are  equivalent  upto  a proportionality  constant. 

1.7  Topics  of  this  Dissertation 

Bayesian  analysis  of  case-control  study  got  less  attention  than  frequentist 
approach.  Specifically,  there  are  only  a few  Bayesian  works  related  to  problems  of 
matched  case-control  study.  This  dissertation  is  designed  to  deal  with  missing  data 
problem  in  a matched  case-control  study.  Each  matched  set  consists  of  a case  and 
several  controls;  and  for  each  subject  in  the  study,  one  observes  a disease  variable 
D , an  exposure  variable(s)  X,  and  a covariate  Z.  X is  assumed  to  be  missing  at 
random  (MAR) . Here  I develop  a method  to  handle  MAR  data  which  exploits 
all  the  available  information  on  a subject  even  if  the  information  on  some  of  the 
exposure  variables  are  missing  for  the  subject.  Logistic  distribution  is  assumed 
for  the  prospective  model  of  the  disease  status  which  may  vary  across  strata.  A 
full-likelihood-based  approach  is  used  by  assuming  a parametric  distribution  of  the 
missing  exposure  variable  among  the  control  population  and  finally  the  odds  ratio 
parameters  are  estimated  in  a fully  Bayesian  framework.  The  major  departure  of 
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this  research  is  to  allow  stratum  heterogeneity  in  the  exposure  distribution  and 
how  one  handles  this  heterogeneity  parameters  in  a fully  Bayesian  framework 
by  using  a Dirichlet  process  prior  with  a normal  base  measure.  In  case  of  fully 
observed  data,  this  assumption  of  heterogeneity  may  not  be  crucial,  but  in  the 
presence  of  missing  data  and  unmeasured  stratum  effect,  it  makes  a lot  of  difference 
in  the  inference.  I extend  the  above  method  to  some  orthodox  data  situations: 

(i)  when  the  case  has  multiple  categories,  (ii)  when  the  exposure  variables  have 
measurement  error  and  contain  partial  missing  observations,  and  (Hi)  when  case- 
control  data  contains  multiple  exposure  variables  which  are  associated  among 
themselves.  The  outline  of  the  rest  of  the  dissertation  is  as  follows.  In  Chapter  2, 

I deal  with  the  problem  of  a missing  exposure  variable  in  a matched  case-control 
study.  A semiparametric  Bayesian  method  is  proposed  to  handle  the  missingness, 
and  to  capture  unexplained  stratum  heterogeneity  on  the  distribution  of  the 
exposure  variable.  In  Chapter  3,  I extend  the  proposed  Bayesian  semiparametric 
method  to  a matched  case-control  study  with  missing  exposure  variable  when  the 
disease  has  multiple  categories.  A unified  semiparametric  Bayesian  approach  is 
presented  in  Chapter  4,  to  handle  simultaneous  occurence  of  measurement  error 
and  missingness  in  an  exposure  variables.  In  Chapter  5,  I consider  the  problem 
of  modeling  association  among  multiple  exposure  variables  with  partially  missing 
values.  Chapter  6 offers  concluding  remarks  and  directions  for  future  research. 


CHAPTER  2 

SEMIPARAMETRIC  BAYESIAN  ANALYSIS  OF  MATCHED 
CASE-CONTROL  STUDIES  WITH  MISSING  EXPOSURE 

2.1  Introduction 

In  this  Chapter,  we  consider  matched  case-control  studies  when  some  exposure 
variables  are  not  observed  for  all  study  subjects.  For  example,  in  the  Los  Angeles 
Study  on  endometrial  cancer  in  a retirement  community  (Breslow  and  Day,  1980),  a 
binary  variable  denoting  obesity  was  missing  in  approximately  16%  of  respondents. 
So  far,  there  are  two  main  approaches  to  handle  the  missingness  in  matched  case- 
control  problems.  Lipsitz,  Parzen  and  Ewell  (1998),  and  Rathouz,  Satten  and 
Carroll  (2002)  modeled  the  missingness  process  directly.  More  germane  to  my 
approach  is  the  likelihood  method  that  involves  modeling  the  missing  covariate 
distribution  among  the  controls,  see  Satten  and  Kupper  (1993a,  1993b),  who 
initiated  this  approach.  Wang,  Wang  and  Carroll  (1997),  and  Paik  and  Sacco 
(2000)  attacked  the  missing  data  problem  directly  by  positing  that  the  exposure 
distribution  among  the  controls  belongs  to  a canonical  exponential  family:  while 
the  latters’  model  is  fully  parametric,  their  estimation  method  is  a pseudolikelihood 
methodology  rather  than  full  likelihood,  and  in  the  absence  of  missing  data 
reduces  to  the  usual  conditional  logistic  regression  (CLR).  Satten  and  Carroll 
(2000)  generalized  the  Satten-Kupper  and  Paik-Sacco  methodology  to  allow  a full 
parametric  likelihood  analysis,  allowing  essentially  any  exposure  distribution  among 
the  controls,  and  providing  for  full  likelihood  analysis:  their  approach  does  not 
reduce  to  the  usual  CLR  with  no  missing  data. 

I develop  a Bayesian  approach  to  case-control  studies  with  a disease  indicator 
D,  a completely  observed  covariate  vector  Z,  and  an  exposure  or  risk  factor  X 
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with  possibly  missing  values.  I will  also  include  variables  S = (Sq,Su)  that  define 
the  strata  used  for  matching.  These  are  generally  a vector  of  covariates,  so  that 
Sa  indicates  covariates  that  are  explicit  factors  making  up  the  strata  and  Su  are 
factors  that  are  not  explicitly  used  to  form  strata  but  nonetheless  are  associated 
with  the  strata.  The  variable  X that  is  partially  missing  can  be  discrete  or 
continuous. 

An  example  will  help  to  clarify  the  strata  issue.  There  has  been  a recent  resur- 
gence of  interest  in  genetic  case-control  studies  to  explore  a variety  of  gene-disease, 
and  gene-environment  interactions.  The  prime  objective  here  is  to  examine  the 
association  between  a candidate  gene  and  the  occurrence  of  a disease.  In  such 
problems,  it  is  very  common  to  stratify  a population  which  comprises  subpopula- 
tions with  different  allele  frequencies  for  the  gene  under  consideration.  If  different 
subpopulations  have  different  risks  for  the  disease,  then  the  failure  to  take  these 
stratum  effects  into  account  will  lead  to  misleading  estimates  of  the  association 
between  the  disease  and  the  candidate  gene. 

A second  example,  specifically  considered  in  this  chapter,  is  due  to  Kim, 

Cohen  and  Carroll  (2002).  They  described  a 1:1  matched  case-control  study 
that  investigated  the  association  of  various  management  practices  with  equine 
colic.  Participating  veterinarians  were  asked  to  provide  data  monthly  for  one 
horse  treated  for  colic  and  one  horse  that  received  emergency  treatment  for  any 
condition  other  than  colic,  between  March  1,  1997,  and  February  28,  1998.  A case 
of  colic  was  defined  as  the  first  horse  treated  during  a given  month  for  signs  of 
intra-abdominal  pain.  A control  horse  was  defined  as  the  next  horse  that  received 
emergency  treatment  for  any  condition  other  than  colic.  Two  individual-level 
variables  of  interest  were  the  age  of  the  horse  X and  an  indicator  Z of  whether  the 
horse  had  undergone  a recent  change  in  diet. 
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In  this  example,  it  is  difficult  to  model  parametrically  the  “next  sick  horse  in 
the  clinic”  matching  method  by  explicit  variables  S0.  One  can  certainly  try,  e.g., 
by  accounting  for  month,  veterinarian,  region  of  Texas,  urban/rural,  etc.,  but  the 
dimensionality  is  likely  to  be  high,  and  fairly  clearly  there  is  more  to  the  matching 
than  can  be  measured  explicitly.  However,  it  is  entirely  possible  that  some  or  all 
of  these  factors  affect  the  distribution  of  X among  the  controls,  and  a likelihood 
analysis  would  have  to  account  for  them.  The  dimensionality  of  the  effects  quickly 
gets  out  of  hand  in  my  example  and  it  is  likely  that  in  such  cases  the  possible  effect 
of  stratification  on  the  distribution  of  X will  be  ignored. 

The  likelihood  approaches  referenced  above  start  with  an  explicit  parametric 
distribution  for  X among  the  controls,  D — 0,  and  given  Z as  well  as  possibly 
the  explicit  parts  SQ  of  the  strata.  Wang  et  al.  (1997)  and  Paik  and  Sacco  (2000) 
do  not  include  the  strata  in  the  model  for  X , while  Satten  and  Carroll  (2000) 
include  the  observed  components  S0  as  a linear  term.  In  the  equine  example,  I 
of  course  believe  that  it  is  effectively  impossible  to  measure  all  the  stratification 
variables,  and  then  to  model  their  effect  on  X,  which  is  likely  to  be  in  the  form  of 
an  unexplained  heterogeneity. 

My  approach  is  to  generalize  the  previous  methods  to  allow  for  the  possibility 
of  unmeasured  stratification  effects,  i.e.,  unexplained  heterogeneity  in  the  distribu- 
tion of  X given  Z among  the  controls.  Alternatively,  my  approach  can  be  looked  at 
as  a way  of  allowing  for  high  dimensional  effects  of  stratification.  The  stratification 
effects  is  modeled  via  a Dirichlet  process  prior  with  a normal  base  measure  for 
the  distribution  of  the  stratum  effects,  and  then  employ  the  Bayesian  machinery. 

By  this  route,  one  achieves  a measure  of  model  robustness,  and  acknowledge  that 
the  distribution  of  X among  the  controls  can  be  affected  by  stratification,  and 
especially  by  unmeasured  factors. 
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The  outline  of  this  Chapter  is  as  follows.  Section  2.2  introduces  the  model, 
and  notation,  while  Section  2.3  derives  the  appropriate  likelihoods  and  the  MCMC 
method  for  computing  Bayesian  inference.  One  notes  in  passing  that  a limiting 
case  of  my  approach  is  a full  parametric  Bayesian  analysis  of  matched  case- 
control  studies  with  missing  data.  Section  2.4  contains  computational  details  of 
the  proposed  method.  In  Section  2.5,  I provide  data  analysis  for  two  examples 
covering  two  special  cases  of  the  general  exponential  family:  (i)  a continuous 
exposure  in  the  equine  epidemiology  problem  and  (ii)  a binary  exposure  in  an 
endometrial  cancer  study.  Section  2.6  contains  a small  simulation  study  to  assess 
the  performance  of  our  proposed  methods.  I show  that  at  least  for  this  simulation, 
the  proposed  methods  lose  nothing  in  the  way  of  efficiency  when  there  are  no 
unmeasured  stratification  effects  on  the  missing  covariate  X , but  gain  quite  a bit  of 
efficiency  when  stratification  does  affect  the  missing  covariate.  Section  2.7  contains 
concluding  remarks. 

In  concluding  this  section,  I highlight  some  of  the  newer  aspects  of  this  work. 
To  my  knowledge,  this  is  the  first  attempt  which  provides  a Bayesian  approach, 
both  parametric,  and  semiparametric,  towards  the  analysis  of  matched  case-control 
studies  with  missing  exposure.  Our  method  is  applicable  to  both  discrete,  and 
continuous  data.  Second,  the  proposed  method  captures  explicitly  the  unmeasured 
stratum  effects  which  may  not  be  possible  just  by  introducing  additional  covariates. 

2.2  Model  and  Notation 

For  subject  j in  stratum  i,  i = 1,  • • • ,n,  j — 1,  ■ • • , M + 1,  let  represent  the 
disease  status,  namely,  = 1 for  a case  = 0 for  a control.  I consider  a single 
exposure  variable  which  may  be  partially  missing  and  a completely  observed 
p-dimensional  vector  of  covariates  Zxj.  I assume  that  the  prospective  conditional 
logistic  distribution  for  the  disease  status  is 


Pr (A;  - l\Xiv  ZtJ,  Si)  = H{P0(Si)  + fiz^  + p2Xij}, 


(2.1) 
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where  H(u ) = {1  + exp(— u)}_1.  Here  Po(Si)  is  a term  representing  the  stratum 
effect  on  the  probability  of  belonging  to  a particular  disease  state,  and  (31  and  f32 
are  coefficients  corresponding  to  the  covariates  and  exposure  in  the  above  logistic 
regression  model.  The  usual  method  is  to  work  with  the  conditional  likelihood  of 
the  Dij  with  Ylrpi  i — 1,  ■ ■ ■ , n as  the  conditioning  variables.  In  this  way,  the 
nuisance  parameters  /30 (Si)  are  eliminated. 

In  the  presence  of  a missing  Ay  for  even  one  subject,  a naive  application  of  the 
above  method  leads  to  removal  of  the  entire  i-th  stratum  even  though  observations 
on  (D,  Z,  X ) for  other  subjects  may  still  be  available.  This  naturally  entails  a loss 
of  information.  To  overcome  this  problem  I adopt  the  following  approach. 

Let  A ij  represent  an  indicator  variable  which  assumes  the  value  0 if  the 
exposure  value  is  missing  for  subject  j — 1, ...,  M + 1 in  stratum  i — 1, ...,  n and  is  1 
otherwise.  The  basic  structure  of  the  model  is 


p{^ij i Xij,  Xij | Zij , Si')  j)(Xij\D[j:  Ajj,  Z ij . S{)  x pf  A^jL^,^,  Z^j,  »Sj)  x p ( !P%j \ Z ij , 5, j . 

(2.2) 

Following  Satten  and  Carroll  (2000),  I will  assume  that  the  X’s  are  missing  at 
random  in  the  sense  of  Little  and  Rubin  (1987),  i.e. , p(Xij\Dij,  Ay,  Zy,  5»)  = 
p(Xij\Dij,  Zij , Si)  and  that  p(AyjZ)y,  Zy,  Si)  does  not  depend  on  any  parameters 
of  interest.  Next  I model  the  distributions  of  the  exposure  variable  X^  among  the 
controls  as 


p(Xij\Dij  0,  Z^,  Si)  exp  [£ij{0ijXij  6(^y)}  T c(Ay,  ^y)], 


(2.3) 


where  (9y  is  the  natural  parameter,  and  will  be  modeled  as 

@ij  To  i T T Z ij , 


(2.4) 


where  70 i,  the  varying  intercept,  captures  the  stratum  effect  on  the  natural 
parameter  6tj,  and  consequently  the  stratum  effect  on  the  exposure  distribution. 
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Paik  and  Sacco  (2000)  adopt  a simpler  parametric  model  with  7 0i  = 70-  If  one 
writes  the  stratification  variables  as  S)  = (5oi,S'uj)  with  Soi  observed  and  Sui 
unobserved,  Satten  and  Carroll  (2000)  set  joi  in  (2-4)  to  be  a parametric  linear 
function  of  Soi • As  argued  in  the  introduction,  it  may  not  be  possible  to  model  the 
70 i explicitly  in  this  way  as  a function  of  the  stratification  effects,  and  my  methods 
differ  from  others  in  the  fact  that  I address  this  issue. 

In  the  following  section,  I derive  the  joint  distribution  for  ( ) conditional 


In  this  section,  I derive  the  likelihood  function,  state  our  prior  distribution, 
and  derive  the  posterior.  The  key  aspect  of  the  modeling  is  in  how  one  handles  the 
strata  effects  in  the  model  for  the  missing  covariate  among  the  controls. 

In  what  follows,  I will  do  calculations  concerning  X as  if  it  were  a continuous 
variable,  and  hence  use  integration:  if  X is  discrete  the  integrals  are  replaced  by 
sums.  The  key  result  is  establishing  the  odds  when  X is  not  considered. 

Let  us  denote  p(X,  Z , S)  as  the  prospective  odds  of  the  disease  for  the 
exposure  X,  the  completely  observed  covariate  Z,  and  the  stratum  S.  Now 
following  Satten  and  Carroll  (2000),  I have  the  following  results. 

Lemma  1. 


on  the  sums  Aj,  and  using  (2.1)-(2.3). 


2.3  Likelihood,  Prior  and  Posterior 


pi  ( Dij  1 1 Z ij , Si ) 

pr(Aj  = 0|  Zij,Si) 


1 


p{Xij,  Zij,  Si)p(Xi:i\Si,  Z ij , Dij  = O)dp(Xij)  (2.5) 


Lemma  2. 


P ( Z-ij  | Dij  1,  Z ij  y iSj) 


P ( Zij  | Si , Z^ , Dij  0 ) p ( Xj-j , Z^ , Si) 


Here  p(-)  is  a <r-finite  dominating  measure.  The  above  two  Lemmas  hold  in  full 
generality  and  do  not  need  logistic  assumption  for  the  prospective  disease  model. 
Notice  that  from  the  prospective  odds  model  and  the  exposure  distribution  in 
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the  control  population,  one  is  able  to  derive  the  exposure  distribution  in  the 
case  population.  Following  are  the  special  cases  of  the  above  two  Lemmas  when 
the  distribution  of  the  exposure  variable  in  the  control  population  belongs  to 
generalized  exponential  family,  as  given  in  (2.3). 

Lemma  3. 

pr(A-  = o|zl-|^)  = 6XP  ^°{Si)  + ^ Zij  + - Wy) }]  , (2-7) 

where  <9*-  = % + ^1/32- 
Proof: 

pr(Aj  = 1| Zi:j,  Si)  = j exp{/?0(S))  + Zt]  + p2Xij}pr(Dij  = 0|X^,  Zi3,  S{ ) 

x p(Xij  | ij j dXij 

= exp{p0(Si)  + $ Z^}  J exp  (p2Xij)p(Xij , DtJ  = 0| Zijt  Si)dXZ] 

= exp {/30(Si)  + Pi  Zij}pi(Dij  = 0| Z^,  Si) 

x J exp ( p2 XVJ ) p ( Xij | Djj  — O'Zij'SJdXij. 

Result  (2.7)  is  important  because  it  means  that  prospectively,  the  marginal 
(over  X)  probability  of  an  event  is  logistic,  and  hence  that  standard  conditional 
logistic  regression  calculations  can  be  employed. 

Lemma  4. 

p(Xij\Dij  = 1,  Z^,  Si)  = exp [£ij{0*jXij  - b(6*j)}  + c(Xij,^j)\.  (2.8) 


Proof: 


p(Xij\Dij  I?  Zij,  Si) 


p(Xij,  Dij  1 1 X ij , S, ) 

pr(JDij  l\Zij,  Si) 


pr ( Dtj  1 1 XLj . Z^ , Sp  x p ( Xij \ Z ij , Si) 

pr  (D^  = 1|  Z^,  Si) 


cxp(02Xi:i ) pr  (Du  0 , X^  | Z ^ , Sp 

pr  {Dij  = 0|  Zij,  Si)  exp[^{6(%  + ^P2)  - H%)}] 
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= exp(^2^)pr(Aj  - 0| Zjj,  S^pjXi^Djj  = 0,  Zy,  Sj) 
pr (Di:j  = 0| ZijtSi)  exp[£ij{b(9*j)  - 6(0y)}] 

= ^P ~ b (0*  )}  + c(Xij,£ij)]. 


I assume,  without  loss  of  generality,  D.n  — 1,  Di2  = 0,  • • • , Am+i  = 0, 
for  each  stratum  i.  Note  that  I only  need  to  define  a probability  model  for 
[Dij\Xij,  Zij,  Si],  and  [Ay \ Dy  — 0 , Zy,Si]  to  arrive  at  the  joint  conditional 
likelihood  of  [ Aj , Xij \ , $,  Y^jLi  D ij  — 1>  Ay].  The  expression  for  this  likelihood 
is  based  on  the  following  key  factorization: 
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A oc  pr{(Ai  = 1,  Aj  = 0 ,j  + 1),  X^,  Zi:j,  Ay,  ^ Aj  = 1} 


i= 1 
n 


3= 1 


M+l 


- Za,  Ai  - 1)  n /'A,j(A/S,.  Zy,  Aj  = 0)} 

j=2 


i=l 


A pr(Ai  = 1|$,  Za)  nS1  Pr(Aj  = 0|5j,  Zy) 

x II 


L=t  TSS1  Pr(A(  = 1|$,  Z«)  nf#r  P^Aj  = 01$,  Zy) 


M+l  . 
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- nAAA^i$, Za, Ai = i)  n pAij(a,i$, Zy, Aj = o)} 

i=2 


i=l 


x TT  Pr(Ai  — 1|$,  Zji)/pr(Ai  = 0]$)  Zq)  g. 

fi  E^1  Pr(Az  = 1|$,  Z«)/pr(A«  = 01$,  Za) ' 

One  may  also  note  that  there  is  an  underlying  asymmetry  in  the  model  in  terms 
of  treating  the  partially  missing  covariate  X and  completely  observed  covariate 
Z which  I assume  to  be  nonstochastic.  One  could  model  the  completely  observed 
covariate  Z in  a retrospective  fashion  as  well  by  assuming  a parametric  distribution 
for  p(Zij\Dij  — 0,  Si).  Adopting  that  model  will  lead  to 


pr(Aj  = l|$)/pr(Aj  = 0|$) 


Pr(Aj  1 1 Zi  ij , Si) 
Pr(Aj  = 0|Zy,5i) 


x p(Zl3 1 -Aj 


0,  Si)dZij. 


This  imposes  assumptions  on  the  unconditional  disease  prevalence  in  each  stratum 
which  is  harder  to  justify  in  comparison  to  (2.5).  More  importantly,  if  that 
specified  model  for  p(Zy|Aj  = 0,$)  is  incorrect,  one  risks  the  possibility  of 
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biased,  and  inconsistent  estimation.  Since  Z is  usually  multivariate  with  a mixture 
of  categorical  and  continuous  covariates,  specifying  an  accurate  model  for  its 
distribution  may  not  be  an  easy  task. 

Using  (2.5)- (2.7),  the  conditional  likelihood  is  given  by 


-kc(A,/?2,7,  701,702,  • • • , 70n) 

ex  | f[^Xa\ZatSi,Da  U)x[]  pAij  (Xij\Zi:j,  Si,  Dij  = 0) 
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expELi  A Zg  + i ~ &(fti)}] 

ntitES1  exp^fZ^-  + &{6(0£)  - &(%)}] 


- []  exp^iAai^Xa  - &(*£)}  + Aac^.ea)] 
2=1 


n M-fl 

x n n exp[Aij6i{^^  - b(eij)} + &iAxij>&j)} 

*= 1 3=2 


><n 


' M+l 
1 + ^ exp 
v 1=2 


^i)+e«{6(^)-6(^-)} 
-&1  &(*«)}! ) 


(2.10) 


The  main  problem  in  (2.10)  is  to  estimate  the  regression  parameters  (31  and 
/?2-  However,  it  may  be  noted  that  while  the  above  conditional  likelihood  has 
eliminated  the  nuisance  parameters  0o(Si),  the  nuisance  parameters  701  remain 
(via  &ij),  and  they  increase  in  direct  proportion  to  the  number  of  strata.  Hence, 
even  the  conditional  maximum  likelihood  of  /31  or  02  is  potentially  subject  to  the 
Neyman-Scott  phenomenon,  and  may  produces  inconsistent  estimators.  Adopting  a 
Bayesian  approach  is  one  way  to  circumvent  this  difficulty. 

I consider  the  following  set  of  mutually  independent  priors,  0i  ~ Normal(/x/?i , 
S/jJ,  02  ~ Normal^^jcr^),  7 j ~ Normal(/i*,  <7*),  and  7oi|G  ~ G,  where  G has  a 
Dirichlet  process  prior  with  a normal  base  measure.  Symbolically,  G ~ DP(aG 0) 
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and  Go  is  N((o,ctq),  where  a is  the  concentration  parameter.  Using  the  result 
of  Antoniak  (1974),  it  follows  that  the  predictive  distribution  of  70^i  given 
701 ) ' ' • , 7o n is 

a 1 " 

70n+ll701i  • ' • .70n  ~ T~ GT^sCo^o2)  + — Ao,(-)-  (2-11) 

Lx  I I L Lx  I / U 

k=l 

Remark  1.  Note  that  for  very  large  values  of  a relative  to  n,  the  Bayesian  semipara- 
metric  method  produces  essentially  the  N((0,al)  predictive  distribution  for  70^+i, 
while  very  small  values  of  a relative  to  n amounts  to  the  fact  that  the  predictive 
distribution  of  70Tppr  essentially  equals  the  “empirical”  distribution  of  701,  ■ • ■ ,7on- 

One  may  also  note  that  as  a — » 00  the  Dirichlet  process  model  reduces  to 
specifying  a parametric  model  on  the  70*,  namely  y0i  ~ Go,  whereas  a = 0 simply 
implies  a parametric  model  with  a constant  stratum  effect,  namely,  ^0i  — 70  and 
7o  ~ Go- 

In  my  computations,  I assume  a Gamma  prior  on  a and  resample  from  the  full 
conditional  distribution  of  a using  a latent  beta  variable  as  described  in  Escobar 
and  West  (1995). 

Let 


ga  = exp [pliZij  - Za)  + Zij {>>{()* j)  - 6(0*)}  - 6r{&(0a)  - 6(0(  1)}]-  (2-12) 


Now  the  full  conditional  distribution  of  the  parameters  may  be  obtained  as: 

n M+ 1 ^ 
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(2.14) 
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n M- 1-1 


a exp[^  - 6(6>*1)}]  x exp[^  A ^{0^  - &(%)}] 


x (2A5) 

, = 1 ->'=9  Zcr* 


i=  1 j=2 
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7r(7oi|-)  oc 


a 


a + n — 1 


1 


n 


a + n — 1 


where  / is  the  indicator  function  and  W\  and  w2  are  two  weight  functions  which 
can  be  calculated  explicitly  and  are  such  that  7r(7oi|-)  is  a proper  density.  Thus 
expression  (2.16)  may  be  derived  using  the  classical  results  of  Antoniak  (1974).  We 
also  note  that  and  9*j  involve  7 7 

For  estimation  of  parameters  I use  a Markov  chain  Monte  Carlo  computing 
scheme.  The  simulation  strategy  is  discussed  in  detail  in  Section  2.6. 

Remark  2.  It  is  easy  to  extend  results  of  this  section  to  multiple  exposures 
Xij  — (Xiji,  ■ ■ ■ ,Xijq)T  where  the  are  mutually  independent,  each  having 
a distribution  belonging  to  the  exponential  family.  One  then  needs  to  introduce 
the  indicator  variables  A = 1 or  0 according  as  X ^ is  observed  or  missing.  The 
likelihood  given  in  (8)  then  needs  to  be  suitably  modified.  The  extension  to  the 
case  when  these  multiple  exposures  may  possibly  be  associated  is  more  involved  as 
one  then  need  to  model  the  association  structure  in  a suitable  way.  These  issues  for 
multivariate  exposures  are  considered  in  Chapter  5. 


Note  that  none  of  the  full  conditionals  will  has  a standard  distributional  form, 
and  thus  simulating  observations  from  the  conditionals  is  not  automatic.  I adopt  a 
componentwise  Metropolis-Hastings  algorithm  for  each  of  the  parameters,  namely 


2.4  Computational  Details 
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0 1,  02,  7-  For  simulating  observations  from  the  posterior  of  70 i I use  an  algorithm 
suggested  by  Neal  (2000).  The  details  of  the  computational  steps  are  as  follows. 


For  parameters  other  than  70 i,  i = 1, ..,  n I essentially  follow  the  same 
Metropolis-Hastings  scheme.  Let  9 stand  for  the  generic  parameter,  i.e. , any  of 
0i,  P2,  7-  Let  Lc(9 1-)  denote  the  conditional  likelihood  as  furnished  in  (2.10)as 
a function  of  9 given  the  data  and  all  other  parameters.  Let  tt(6)  be  the  prior 
distribution  on  9.  In  order  to  simulate  observations  from  the  full  conditional 
distribution  of  9 , namely  7r(0[-)  I follow  the  following  steps. 

Step  1.  Start  with  any  reasonable  initial  value  of  9,  say  90.  This  is  the  current 
value  for  9. 

Step  2.  Generate  a new  value  of  9,  say  9*  from  a candidate  density  g(6). 

Step  3.  Replace  90  by  9*  with  probability  p(9,9*)  — min-fl,  J }■ 

n{Oo\-)g{0*) 

Retain  the  existing  value  of  9q  otherwise. 

Remark  3.  I choose  the  candidate  density  g(9)  to  be  identical  to  the  prior  density 
7 :{9).  Note  that  the  full  conditional  density  of  9,  namely,  7r(#|-)  oc  n(9)Lc(9\-). 
Consequently,  the  acceptance  probability  as  described  in  Step  2 above,  reduces  to 
(after  cancelation  of  the  prior  term  with  the  identical  candidate  density  term), 


• X1  Lc(9*  \.) 

Once  I update  the  value  of  the  first  set  of  parameters,  say,  01,  I move  on  to  the 
similar  cycle  for  02  with  the  updated  value  of  01  substituted  in  the  likelihood  in 
(2.10).  After  updating  /32,  one  can  proceed  to  the  Metropolis-Hastings  step  for  7X 
with  the  updated  values  of  01  and  /32,  and  finally  with  all  three  updated  values  of 
0i,  02,  7i  to  the  iteration  steps  for  70i,  i — 1, ..,  n. 
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Note  that  the  convergence  of  this  Markov  chain  to  the  proper  conditional 
distribution  7r(0|-)  is  automatic  because  the  transition  kernel 

k(6,  r)  = p(9,  e*)7 r(0*)  + (1  - r{O))50(P),  (2.17) 

where  r(9)  = f (1  — p(9,  9*))ir(9*)d9*  and  5g  denotes  the  Dirac  mass  at  9 , satisfies 
the  detailed  balance  condition  (see  Robert  and  Casella,  2001  for  details)  i.e. , 

k{e,9*)n{9\-)  = k(9*,6)n{9*\-). 

Remark  4.  For  the  equine  example  in  Section  2.5.1  with  the  exposure  (age) 
distribution  modeled  normally  there  will  be  an  additional  scale  parameter  a2  on 
which  I will  assume  an  Inverse  Gamma  prior.  In  that  case  there  is  an  additional 
Metropolis  Hastings  step  for  a2  similar  to  that  described  above. 

Generating  observations  from  conditional  of  'fop  For  simulating  observations 
from  the  full  conditional  of  7oi,  i — 1,  ..,n,  first  note  that  this  will  be  substantially 
computation-intensive  as  n is  typically  large  in  many  of  the  real  examples.  I adopt 
one  of  the  algorithms  suggested  by  Neal  (2000)  for  generating  observations  from  the 
posterior  of  Dirichlet  mixtures. 

To  illustrate  the  computational  steps,  let  us  fix  our  attention  to  a particular 
stratum,  say,  stratum  k.  Also  let  7o(-fc)  = (7oi,  7o  k-i,  7o  fc+i,  "7on),  the  vector  of 
all  but  the  kth  stratum  effect  parameter.  I proceed  in  the  following  way: 

Step  1.  Start  with  some  reasonable  values  of  70* , i — 1,  ..,n.  The  current  state 
of  the  Markov  chain  consists  of  these  n values,  namely,  (701,  - - , 70* , 7o«)- 

Step  2.  Draw  a candidate  value  7^.  from  the  distribution  of  7ofc|7o(-fc),  namely, 
from, 

n 

(a  + n-  l)_1^/70.(-)  T{a/(Q  + n — 1)}N(-;  Co,  Co)- 

j/fc 

There  are  several  ways  of  generating  observations  from  the  above  mixture  distri- 
bution, which  essentially  reflects  that  for  the  k- th  stratum  effect  you  either  get 
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a new  distinct  value  from  the  normal  component  of  the  mixture  with  probability 
a/ {a  + n — 1),  otherwise  it  is  drawn  with  equal  probability  from  the  current  set 
of  (n  — 1)  entries  of  7o(-/t),  the  vector  corresponding  to  the  other  (n  — 1)  stratum 
effect  parameters.  Following  is  one  particular  way  to  generate  observations  from 
this  candidate  density. 

A)  Let  pj  = l/(a  + n — 1)  for  j = 1, n,  j ^ k and  pk  — a/(a  + n — 1). 

B)  Calculate  the  cumulative  probabilities  c;  = Y?j=i Pj-  Let  co  — 0- 

C)  Draw  a random  number  u from  U(0,1)  distribution.  Then  Cj_i  < u < Cj  for 
some  specific  j € {1,  ..n}.  If  j — k,  draw  the  candidate  7^.  from  N((0,(Tq).  If  j ^ k, 
take  7o  k = To  j- 

Step  3.  Compute  the  acceptance  probability  a(7ofc;7ofc)  = min{l,  Lc(7gfc|-)/Lc(7ofc|-)}, 
where  -Lc(7ofcl')>  an^  -^c(7o*|')  are  the  conditional  likelihoods  as  given  in  (2.10)  eval- 
uated at  7ofc  and  7 0fe,  respectively,  with  all  other  parameters  held  fixed  at  their 
current  values. 

Step  4.  Set  the  new  value  of  70*,  to  7^  with  the  above  probability. 

Step  5.  Repeat  Steps  2-4  R times.  I choose  R = 5 in  all  my  computations. 

Consider  the  last  of  these  R updates  of  7 0fe  as  the  current  value  of  the  A;-th  stratum 
effect  parameter.  I repeat  the  above  steps  for  all  n stratum  effect  parameters,  so  for 
the  equine  data,  there  will  be  498  such  cycles  to  be  performed,  one  for  each  of  the 
7o»,  * = 1,  -n. 

Remark  5.  For  resampling  from  the  full  conditional  distribution  of  a I follow  the 
algorithm  suggested  by  Escobar  and  West  (1995).  At  each  step  I count  the  number 
of  distinct  70 »,  say  k,  and  conditional  on  the  current  values  of  a,  and  k,  I simulate  a 
latent  beta  random  variable  say,  77  ~ B(a  + l,n).  Let  G(a,b ) denote  the  prior  on 
a.  Using  the  current  value  of  k,  and  77,  a is  simulated  from  the  following  mixture  of 
Gamma  distribution, 


7r(a|77,  k)  = pG(a  + k,b-  \og(r]))  + (1  - p)G(a  + k — l,b  — log(?7)) 
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where  p — (a  + k — 1 )/[a  + k — 1 + n{b  — log(?7)}]. 

One  complete  step  of  the  entire  Markov  chain  consists  of  the  above  compo- 
nentwise Metropolis-Hastings  steps  for  all  the  parameters  in  the  model.  I proceed 
iteratively  through  the  whole  cycle  for  all  the  parameters  until  convergence.  I 
run  the  chain  typically  from  7000-10000  iterations,  and  calculate  the  diagnostic 
proposed  by  Gelrnan  and  Rubin  (1992)  as  a measure  of  convergence.  I furnish  the 
posterior  means  as  my  estimates  for  the  parameters,  and  provide  as  well  the  highest 
posterior  density  credible  region  for  all  the  parameters.  The  codes  for  implementing 
the  methodology  are  available  on  my  homepage  http://www.stat.ufl.edu/~  ssinha. 

2.5  Two  Special  Cases  with  Data  Examples 
It  is  worth  noting  that  the  proposed  model  can  accommodate  both  continuous 
and  discrete  exposure  variables  as  well  as  modeling  for  the  missingness  in  the 
exposure  variable.  In  this  section,  I present  two  particular  illustrations,  one  with 
a normally  distributed  exposure  variable  and  the  other  with  a binary  exposure 
variable,  all  motivated  by  real  examples. 

2.5.1  Continuous  Exposure:  Equine  Epidemiology  Example 

The  first  example  is  the  equine  epidemiology  example  described  in  the  in- 
troduction. Age  was  considered  as  a single  exposure  variable  (X)  measured  on  a 
continuous  scale  with  one  binary  covariate  ( Z ) indicating  whether  the  horse  experi- 
enced recent  diet  changes  or  not.  For  scaling  purposes  I linearly  transformed  age  so 
that  X was  on  the  interval  [0, 1].  Since  I am  dealing  with  a single  covariate  and  one 
exposure  variable  here,  the  corresponding  regression  parameters  will  all  be  scalar. 

The  proposed  method  for  the  general  exponential  family,  as  described  in 
the  previous  section,  applies  here  with  continuous  exposure  X . I assume  that 
conditional  on  Dij  = 0,  i.e.,  for  the  control  population  the  exposure  variable  age 
has  a normal  distribution  of  the  following  form  : 


[XijlDij  = 0,  Zij,Si]  ~ Normal (70i  + 71 2^,  a2), 


(2.18) 
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where  70 ; are  stratum  effects  on  the  exposure  distribution  and  71  is  the  regression 
parameter  for  regressing  Xij  on  the  measured  covariates  Z.Lj . I also  assume  the 
logistic  regression  model  for  disease  status  (2.1).  From  (2.8)  it  follows  the  exposure 
distribution  among  the  cases  is  Normal  with  mean  7 0i  + 71  Zy  + a2/3 2 and  variance 
a2.  The  conditional  likelihood  analogous  to  (2.10)  can  now  be  derived  using  these 
facts.  I used  the  following  prior  distributions.  The  parameters  (/?i , /?2 j7i)  were 
independent  Normal(0, 10),  and  the  70j’s  had  a Dirichlet  prior  with  base  measure 
G0  ~ Normal(0, 10).  Experimentation  suggested  that  the  parameter  estimates 
remain  relatively  unchanged  over  a range  of  Gamma  priors  on  a.  Table  2-1 
contains  the  results  corresponding  to  a Gamma(2,  2)  prior  on  a.  The  variance  a2 
in  (2.18)  is  assigned  an  inverse  gamma  prior,  i.e.  n(a2)  oc  (cr2)~(C//2)-1exp(— ^2), 
written  symbolically  as  IG(c/2,  d/2):  for  my  example  c — 60  and  d — 2. 

The  full  conditional  distributions  for  all  the  parameters  have  compact  ex- 
pressions that  can  be  derived  as  special  cases  of  equations  (2.13)-(2.16).  The  full 
conditional  distribution  for  the  additional  scale  parameter  a2  is  given  by, 


For  estimation  of  the  parameters  I adopt  a Markov  chain  Monte  Carlo  numeri- 
cal integration  scheme.  For  simulating  observations  from  the  posterior  distribution 
of  70 i I follow  the  idea  of  Escobar  (1994),  West,  Muller  and  Escober  (1994),  Es- 
cober  and  West  (1995),  and  Neal  (2000).  For  generating  random  numbers  from 
the  conditional  distribution  of  other  parameters  I follow  a standard  componentwise 
Metropolis-Hastings  scheme  (Robert  and  Casella,  2001).  The  details  of  the  com- 
puting scheme  are  provided  in  Section  2.4.  For  each  of  my  examples  and  simulated 
datasets,  I conduct  three  analyses.  One  is  my  proposed  Bayesian  semiparametric 


(2.19) 


where  A = -2 (a  - f ),  /r  = a = ~\  E"=i{P^  1 ~ #u)2  + (Xi2  - <9i2)2}, 

b=-  PI 
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Equine  Data:  Variability  of  the  stratum  effects 


Figure  2-1:  Plot  of  the  variability  of  70i’s  for  the  equine  data  example 
Estimates  of  498  7oi’s  were  collected  for  each  of  the  last  5000  MCMC  samples. 
Variance  of  these  498  values  were  then  calculated  for  each  run.  The  histogram  is  of 
these  3000  variance  values  with  a kernel  density  estimate  overlayed  on  it. 

analysis  (BSP).  The  other  is  a parametric  Bayesian  (PB)  version  of  the  work  of 
Satten  and  Carroll  (2000)  by  considering  a constant  stratum  effect  7 0i  = 70  and 
then  assuming  a normal  prior  on  this  common  stratum  effect  parameter  70.  The 
parametric  Bayesian  method  is  simply  the  Bayesian  analog  of  the  method  of  Satten 
and  Carroll  (2000).  Also,  note  that  the  PB  method  is  a special  case  of  BSP  method 
with  a equal  to  zero. 
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I also  present  the  frequentist  alternative  for  analyzing  this  matched  data 
through  a standard  conditional  logistic  regression  (CLR)  analysis  (Breslow  and 
Day,  1980). 

I ran  the  three  analyses  mentioned  above  on  the  equine  dataset  with  all  498 
strata.  The  three  analysis  were  reran  where  I randomly  deleted  40%  of  the  age 
observations.  The  results  are  given  in  Table  2-1.  Briefly,  with  no  missing  data  the 


Figure  2-2:  Predictive  distribution  of  7o„+i  for  the  equine  data  example 
The  dashed  line  shows  the  predictive  distribution  of  7on+i  for  the  Equine  data  ex- 
ample overlayed  with  the  solid  line  for  the  distribution  function  of  the  base  measure 
N(0, 10). 

results  of  the  methods  are  roughly  in  accordance  with  one  another.  The  slightly 
small  standard  error  estimates  for  the  Bayesian  methods  may  be  a reflection  of  the 
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fact  that  unlike  CLR,  they  are  full  likelihood  based  semiparametric  and  parametric 
methods.  With  missing  data,  of  course,  one  sees  that  the  Bayesian  methods  yield 
much  smaller  standard  errors  than  CLR,  reflecting  the  fact  that  proposed  methods 
use  all  the  data,  and  not  just  the  pairs  with  no  missing  data.  There  is  little 
difference  between  the  semiparametric  analysis  and  the  parametric  Bayes  version 
of  the  method  of  Satten  and  Carroll  (2000),  a phenomenon  that  can  be  explained 
as  follows.  Figure  2-1  shows  a density  plot  of  the  variance  of  70i’s  across  the  498 
strata  in  the  last  5000  MCMC  samples.  The  average  variance  among  the  70; ’s  is 
very  small  (about  0.0054),  suggesting  that  the  stratum  effects  are  almost  constant 
across  strata.  Thus  the  parametric  model  with  constant  stratum  effect  essentially 
holds  for  this  example,  and  hence  it  is  not  surprising  that  the  two  methods  give 
similar  results.  Figure  2-2  presents  a plot  for  the  predictive  distribution  of  70tppt 
given  the  data  and  it’s  evident  from  this  figure  that  the  distribution  is  different 
from  simple  normal  model. 

2.5.2  Binary  Exposure:  Endometrial  Cancer 

In  the  Los  Angeles  1:4  matched  case-control  study  on  endometrial  cancer  as 
discussed  in  Breslow  and  Day  (1980),  the  cases  were  matched  with  four  controls 
on  the  basis  of  age,  the  community  the  case  lived  in,  marital  status,  and  the  time 
the  case  had  entered  in  the  community.  The  data  has  measurement  on  several 
binary  covariates:  presence  of  gall  bladder  disease,  hypertension,  use  of  estrogen,  to 
name  a few.  The  variable  obesity  may  be  considered  as  a binary  exposure  variable 
(. X ) for  contracting  endometrial  cancer.  This  variable  had  about  16%  missing 
observations. 

I considered  the  presence  of  Gall  Bladder  disease  in  the  subject  as  the  com- 
pletely observed  dichotomous  covariate  Z. 

In  such  a case  with  dichotomous  exposure  variable,  the  exposure  distribution 
for  control  population  is  naturally  assumed  to  have  a logistic  form:  pr (Xij  = 
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1| Zij)  — iJ(7oi  + 71  Zij),  where  H(-)  is  the  logistic  distribution  function.  From  (2.8) 
the  exposure  distribution  in  the  case  population  is  again  of  the  logistic  form  with: 

P r{Xij  = 1|  Z^)  — + 71 + fa)- 

In  the  LA  study  there  were  63  strata.  For  this  example  also  I adopted  the 
same  Metropolis  Hastings  scheme  by  Neal  (2000)  for  simulating  observations 
from  the  posterior  of  7oi’s.  Alternately,  one  can  use  the  approach  of  MacEachern 
and  Muller  (1998),  but  Neal’s  approach  seems  easier  to  implement  in  my  case.  A 
detailed  outline  of  the  algorithm  and  computing  scheme  is  already  given  in  Section 
2.4. 


Figure  2-3:  Plot  of  the  variability  of  70; ’s  for  the  LA  cancer  study  example 
Estimates  of  the  63  70i’s  were  collected  for  each  of  the  last  5000  MCMC  samples. 
Variance  of  these  63  values  were  then  calculated  for  each  run.  The  histogram  is  of 
these  3000  variance  values  with  a kernel  density  estimate  overlayed  on  it. 
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I present  the  results  with  a Gamma  (2,  2)  prior  on  a.  I considered  a 
Normal(0, 10)  prior  on  all  the  parameters  as  well  as  the  base  measure  in  the 
Dirichlet  process  prior. 

The  original  data  contained  missing  observations  on  the  exposure  variable 
obesity.  I modeled  the  missingness  in  the  exposure  variable  in  both  the  parametric 
Bayes  (the  Bayesian  version  of  the  Satten  and  Carroll  method  with  fixed  stratum 
effects)  and  the  semiparametric  Bayes  analysis.  Classical  conditional  logistic 
analysis  is  also  conducted.  Table  2-2  presents  the  results.  Although  there  are 
certainly  some  interesting  numerical  differences,  in  the  main  the  results  are 
reasonably  comparable.  Figure  2-3  presents  a density  plot  of  the  variance  of  the 
63  stratum  effects  in  the  last  5000  MCMC  samples  with  the  BSP  method.  The 
average  variance  among  the  70i’s  is  approximately  4.5.  This  is  fairly  large,  and  if 
there  were  major  covariate  effects  one  would  expect  the  BSP  and  the  parametric 
Bayes  version  of  the  method  of  Satten  and  Carroll  (2000)  to  be  different.  However, 
in  this  example  the  analyses  suggest  that  /?2  — 0 is  plausible,  so  that  D and  X are 
essentially  independent  and  whether  the  7oi’s  are  constant  or  not  does  not  cause 
a model  bias.  This  is  an  example  where  there  is  stratum  variability  and  natural 
missingness  in  the  data.  Figure  2-4  presents  a plot  of  the  predictive  distribution  of 
ToiPFT- 

2.6  Simulation  Study 

In  order  to  simulate  a realistic  dataset  for  comparing  the  Bayesian  semipara- 
metric method  with  the  parametric  Bayes  version  of  the  method  of  Satten  and 
Carroll  (2000)  and  standard  matched  conditional  logistic  regression,  I used  the 
equine  data  itself  as  a prototype,  except  that  I reversed  the  roles  of  diet  change  and 
age.  I generated  a hypothetical  1:1  matched  dataset  with  50  strata  and  with  now 
diet  change  (a  binary  variable)  as  the  exposure  ( X ) and  age  of  horse  (transformed 
on  to  [0,1])  as  an  observed  covariate  ( Z ).  To  elicit  a realistic  value  for  the  true 
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Figure  2-4:  Predictive  distribution  of  7on+i  for  the  LA  cancer  study  example 
The  dashed  line  shows  the  predictive  distribution  of  7ora+i  for  the  Los  Angeles  Can- 
cer Study  example  overlayed  with  the  solid  line  for  the  distribution  function  of  the 
base  measure  N(0, 10). 


parameters  (/?i,/?2)7i)  I made  use  of  the  original  equine  dataset  at  hand.  I first 
performed  a conditional  logistic  regression  analysis  of  the  full  equine  data  with  X 
and  Z as  covariates.  The  parameter  estimates  for  the  coefficients  of  X and  Z using 
the  CLR  analysis  were  chosen  as  the  true  values  of  f3\  and  /?2  in  my  simulation: 
these  were  2.049,  and  2.129,  respectively.  In  order  to  elicit  values  for  71;  I ran  a 
logistic  regression  with  diet  change  X as  the  response  and  age  Z as  the  covariate. 
The  fitted  equation  had  logits  —3.66  + 0.63 Z.  In  the  simulation  study  I thus  used 
7i  = 0.63.  The  purpose  of  the  simulation  was  to  explore  the  relative  performance 


44 


of  the  three  methods  when  in  fact  70 » = 70,  the  standard  model  assumptions  were 
true,  and  also  for  the  situation  when  the  constant  stratum  effect  assumption  does 
not  hold  and  70*  came  from  a distribution  with  reasonable  variability. 

I followed  the  structure  of  my  models  as  described  in  Section  2 for  simulating 
X,  D,  and  Z.  I started  first  with  generating  the  covariate  Z.  For  the  equine  data, 
the  histogram  of  the  variable  horse  age  resembled  a distribution  which  could  well 
be  described  by  a Gamma  distribution  with  shape  and  scale  parameters  1.8  and 
0.2  respectively.  Accordingly,  I generated  covariate  Z from  the  same  Gamma 
distribution  and  then  scaled  the  generated  values  to  [0,1].  The  results  remain 
similar  if  one  generates  Z from  a normal  distribution. 

Second,  I generated  the  binary  variable  D standing  for  disease  status.  For  the 
i-th  stratum,  it  may  easily  be  noted  that, 


Pr(Dn  — 1| Du  + Di2  — 1,  Zn,  Zi2,  Si)  — 1 — 


Qi2 

Qi2  + exp  [f3i(Zn  — Zi2)]Qa 


where,  for  j — 1, 2, 

Q = 1 + exp(7oi  + 71  Zjj  + P2) 

13  1 + exp(7oi  +71%) 

A Bernoulli  variable  was  generated  with  the  above  specified  probability  structure 
and  conditional  on  the  fact  that  the  simulated  value  for  Du  is  a 1,  i.e. , the  subject 
is  a member  of  the  case  population,  I generated  the  binary  exposure  mimicking  diet 
change  (A)  from  a Bernoulli  distribution  with  logit  (p^  — 70  i + 71  Zn  + (32\  if  the 
simulated  value  for  Du  is  0,  i.e.,  the  subject  is  a member  of  the  control  population, 
I generate  X from  a Bernoulli  distribution  with  logit(pi)  = 70;  + 7iAi-  Once 
we  have  simulated  DiX  in  the  above  manner,  for  the  other  observation  in  the  i-th 
stratum,  Di2  = 1 — Du.  The  corresponding  exposure  value  Xi2  is  then  simulated 
from  a Bernoulli  distribution  with  either  logit(pi ) = 70;  + 71^2  (when  Di2  = 0)  or 
logit(pi ) = 70 i + 71^2  + /?2  (when  Da  = 1). 
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I performed  two  sets  of  simulations,  one  with  a constant  value  of  70*  namely, 
-3.66  (this  choice  is  elicited  above)  the  other  with  70;  simulated  from  a normal 
distribution  with  mean  -3.66  and  standard  deviation  2.  The  results  are  presented  in 
Table  2-4. 

In  this  simulations  I used  identical  parameters  for  the  normal  distribution 
with  mean  zero  and  variance  10,  which  was  assumed  to  be  the  prior  on  70  in  the 
parametric  Bayes  version  of  the  method  of  Satten  and  Carroll  (2000)  and  also  as 
the  base  measure  in  the  Dirichlet  Process  prior  for  the  semiparametric  Bayesian 
method.  I used  a Gamma  (2,  2)  prior  on  a.  I replicated  the  simulations  50  times, 
generating  50  different  datasets,  and  obtained  the  parameter  estimates  by  the  three 
methods,  and  computed  the  mean  and  mean  squared  error  of  these  50  estimates. 

For  each  replication  I also  generated  a dataset  with  30%  observations  missing 
completely  at  random  and  recalculated  all  the  estimates. 

The  results  are  fairly  clear.  In  the  case  of  fixed  strata  effects  and  no  missing 
data,  there  are  little  in  the  way  of  differences  among  the  Bayesian  semiparametric 
method,  the  parametric  Bayesian  counterpart  to  the  frequentist  method  of  Satten 
and  Carroll,  and  ordinary  conditional  logistic  regression  (CLR).  However,  if  there  is 
30%  missing  data,  then  ordinary  CLR  is  clearly  less  efficient.  This  might  be  looked 
upon  as  a quantification  of  robustness  of  efficiency  for  the  Bayesian  semiparametric 
method.  On  the  other  hand,  if  there  are  major  unexplained  stratum  effects,  i.e. , 
the  70 i have  considerable  variability,  then  one  can  see  that  the  methods  separate, 
with  the  Bayesian  semiparametric  method  clearly  dominating  in  terms  of  mean 
squared  error. 

2.7  Concluding  Remarks 

This  Chapter  considers  matched  case-control  studies  with  missing  exposure 
data.  Parametric  analyses  require  handling  stratum  effects  on  the  distribution  of 
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the  exposure  distribution.  Previous  methods  either  ignore  such  stratum  effects,  or 
assume  that  they  can  be  captured  in  their  entirety  by  a parametric  model. 

I have  presented  a Bayesian  semiparametric  approach  to  model  the  stratum 
effects.  The  methods  proposed  can  handle  discrete  as  well  as  continuous  exposure, 
and  also  account  for  possible  missingness  in  the  exposure  variable.  The  methodol- 
ogy is  illustrated  through  real  examples.  Results  based  on  simulated  data  indicate 
that  in  the  presence  of  varying  stratum  effects,  the  semiparametric  Bayesian 
method  outperforms  the  parametric  Bayesian  method  and  frequentist  methods 
currently  in  the  literature. 
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Table  2-1:  Results  of  the  equine  data  example 


Full  Equine  Data 

Method 

ft 

ft 

10  x 71 

10  X <72 

Bayes  Semiparametric 

Mean 

2.16 

2.18 

0.18 

0.23 

s.e. 

0.33 

0.39 

0.14 

0.01 

Lower  HPD 

1.57 

1.36 

-0.10 

0.21 

Upper  HPD 

2.80 

2.88 

0.40 

0.25 

Bayes  Parametric 

Mean 

2.14 

2.10 

0.22 

0.26 

s.e. 

0.32 

0.41 

0.17 

0.01 

Lower  HPD 

1.60 

1.32 

-0.20 

0.24 

Upper  HPD 

2.87 

2.88 

0.60 

0.28 

CLR 

Mean 

2.13 

2.05 

s.e. 

0.32 

0.47 

Lower  CL 

1.50 

1.13 

Upper  CL 

2.76 

2.97 

Equine  Data  With  40%  Missing  Data 

Method 

Pi 

ft 

10  x 7! 

10  x a2 

Bayes  Semiparametric 

Mean 

2.16 

2.30 

0.14 

0.23 

s.e. 

0.32 

0.48 

0.24 

0.01 

Lower  HPD 

1.70 

1.30 

-0.25 

0.19 

Upper  HPD 

2.80 

3.42 

0.60 

0.25 

Bayes  Parametric 

Mean 

2.13 

1.90 

0.17 

0.28 

s.e. 

0.32 

0.47 

0.20 

0.02 

Lower  HPD 

1.56 

0.92 

-0.20 

0.24 

Upper  HPD 

2.85 

2.85 

0.67 

0.31 

CLR 

Mean 

2.81 

2.79 

s.e. 

0.59 

0.77 

Lower  CL 

1.65 

1.47 

Upper  CL 

3.96 

4.49 

Here  “Mean”  is  the  posterior  mean,  “s.e.”  is  the  posterior  standard  deviation, 
“Lower  HPD”  and  “Upper  HPD”  are  the  lower  and  upper  end  of  the  I4PD  region 
respectively,  “Lower  CL”  and  “Upper  CL”  are  the  lower  and  upper  end  of  the 
confidence  limit  respectively,  The  parametric  Bayesian  method  is  the  Bayesian 
version  of  the  method  of  Satten  and  Carroll  (2000). 
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Table  2-2:  Results  of  the  endometrial  cancer  data  example 


Gall  Bladder 

Obesity 

Method 

Pi 

Pi 

7i 

Bayes  Semiparametric 

Mean 

1.29 

0.70 

-0.15 

s.e. 

0.39 

0.41 

0.49 

Lower  HPD 

0.54 

-0.10 

-0.99 

Upper  HPD 

2.18 

1.49 

0.99 

Bayes  Parametric 

Mean 

1.19 

0.49 

0.42 

s.e. 

0.39 

0.36 

0.62 

Lower  HPD 

0.58 

-0.15 

-0.70 

Upper  HPD 

2.12 

1.25 

1.78 

CLR 

Mean 

1.28 

0.44 

s.e. 

0.39 

0.38 

Lower  CL 

0.52 

-0.30 

Upper  CL 

2.04 

1.19 

Here  “Mean”  is  the  posterior  mean,  “s.e.”  is  the  posterior  standard  deviation, 
“Lower  HPD”  and  “Upper  HPD”  are  the  lower  and  upper  end  of  the  HPD  region 
respectively,  “Lower  CL”  and  “Upper  CL”  are  the  lower  and  upper  end  of  the 
confidence  limit  respectively,  The  parametric  Bayesian  method  is  the  Bayesian 
version  of  the  method  of  Satten  and  Carroll  (2000). 
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Table  2-3:  Results  of  the  simulation  study  for  full  data 


Full  data, 

fixed  70^  = —3.66 

Method 

Pi 

P2 

7i 

Bayes  Semiparametric 

Mean 

1.9880 

2.0949 

0.6938 

MSE 

0.7595 

0.5324 

1.0342 

Bayes  Parametric 

Mean 

2.0089 

1.8183 

0.6743 

MSE 

0.7785 

0.5159 

0.6454 

CLR 

Mean 

2.3157 

1.9496 

MSE 

2.0505 

0.4603 

Full  data,  varying  70*  = 

Normal  ( 

-3.66,4) 

Method 

Pi 

P2 

7i 

Bayes  Semiparametric 

Mean 

1.7838 

2.2194 

0.4729 

MSE 

0.7126 

0.5042 

0.8015 

Bayes  Parametric 

Mean 

1.7812 

1.4641 

0.4795 

MSE 

0.9550 

0.6488 

0.8333 

CLR 

Mean 

2.6350 

1.7747 

MSE 

2.2622 

0.5667 

Here  “Mean”  is  the  mean  of  the  50  estimates  corresponding  to  the  50  simulated 
datasets  while  MSE  is  the  mean  squared  error.  The  true  parameter  values  are 
Pi  = 2.049,  P2  — 2.129,  and  yj  = 0.63.  The  parametric  Bayesian  method  is  the 
Bayesian  version  of  the  method  of  Satten  and  Carroll  (2000). 
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Table  2-4:  Results  of  the  simulation  study  for  missing  data 


30%  missing  data,  fixed  70i  = 

-3.66 

Method 

Pi 

P2 

7i 

Bayes  Semiparametric 

Mean 

2.0022 

2.4606 

0.9617 

MSE 

0.7925 

0.8962 

1.0962 

Bayes  Parametric 

Mean 

2.0141 

1.8156 

0.6913 

MSE 

0.7905 

0.8561 

0.9125 

CLR 

Mean 

2.9713 

1.2689 

MSE 

5.6249 

1.3182 

30%  missing  data, 

varying  y0i  = Normal(— 3.66, 4) 

Method 

Pi 

P2 

7i 

Bayes  Semiparametric 

Mean 

1.7922 

2.0150 

0.4935 

MSE 

0.8095 

0.8152 

1.0444 

Bayes  Parametric 

Mean 

1.7535 

1.5081 

0.5737 

MSE 

0.9690 

0.9444 

1.3068 

CLR 

Mean 

3.6817 

1.2078 

MSE 

10.2676 

1.3744 

Here  “Mean”  is  the  mean  of  the  50  estimates  corresponding  to  the  50  simulated 
datasets  while  MSE  is  the  mean  squared  error.  The  true  parameter  values  are 
Pi  = 2.049,  p2  — 2.129,  and  — 0.63.  The  parametric  Bayesian  method  is  the 
Bayesian  version  of  the  method  of  Satten  and  Carroll  (2000). 


CHAPTER  3 

BAYESIAN  SEMIPARAMETRIC  MODELLING  FOR  MATCHED 
CASE-CONTROL  STUDIES  WITH  MULTIPLE  DISEASE  STATES 

3.1  Introduction 

This  Chapter  considers  matched  case-control  studies  with  multiple  disease 
states.  In  some  situations  it  is  natural  to  note  that  the  disease  state  might  have 
more  than  one  category,  i.e.,  one  may  have  subdivisions  within  the  “cases”.  For  ex- 
ample, for  patients  diagnosed  with  cancer,  they  may  have  cancer  of  stage-I,  stage-II 
or  stage-III  at  the  time  of  the  diagnosis  which  is  an  example  of  ordinal  disease  cat- 
egories. One  may  also  notice  nominal  categories  for  disease  states  such  as  patients 
classified  into  having  one  eye  or  both  eyes  damaged.  To  the  best  of  my  knowledge 
there  has  been  hardly  any  Bayesian,  and  very  little  frequentist  work  for  analyzing 
matched  data  when  one  has  multiple  disease  states.  Along  with  considering  poly- 
tomous  disease  states  I extend  the  typical  methodology  for  analyzing  case-control 
data  in  the  following  way.  To  deal  with  partially  missing  exposure  variable  in  this 
setup,  I work  with  full  conditional  likelihood  by  assuming  a prospective  model  of 
the  disease  status  and  a parametric  distribution  of  the  missing  exposure  variable 
in  the  control  population  given  some  observed  covariates  and  stratum  effect.  More 
specifically,  I try  to  capture  unmeasured  stratum  heterogeneity  through  the  distri- 
bution of  the  missing  exposure  variable.  As  a result,  mentioned  in  Chapter  2,  even 
the  conditional  likelihood  involves  a set  of  nuisance  parameters  due  which  grow 
in  direct  proportion  to  sample  size.  Therefore  to  handle  the  potential  problem  of 
inconsistency  of  the  paramer  estimates  I took  a semiparametric  Bayesian  approach 
by  using  a Dirichlet  process  prior  on  the  stratum  heterogeneity  parameters. 
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The  example  considered  in  this  Chapter  involves  a matched  case-control 
dataset  coming  from  a low-birth-weight  study  conducted  by  the  Baystate  Medical 
Center  in  Springfield,  Massachusetts.  The  dataset  is  discussed  in  Hosmer  and 
Lemeshow  (2000,  Section  1.6.2),  and  is  used  as  an  illustrative  example  of  analyzing 
a matched  case-control  study  in  Chapter  7 of  their  book.  Low-birth- weight,  defined 
as  birth  weight  less  than  2500  grams,  is  a cause  of  concern  for  a newborn  as  infant 
mortality,  and  birth  defect  rates  are  very  high  for  low-birth-weight  babies.  The 
data  was  matched  according  to  the  age  of  the  mother.  A woman’s  behavior  during 
pregnancy  (smoking  habits,  diet,  prenatal  care)  can  greatly  alter  the  chances  of 
carrying  the  baby  to  terms.  The  goal  of  the  study  was  to  determine  whether  these 
variables  were  “risk  factors”  in  the  clinical  population  served  by  Baystate  Medical 
Center.  In  particular  using  the  actual  birth  weight  observations  I divided  the 
cases,  namely,  the  low-birth-weight  babies  into  two  categories,  very  low  (weighing 
less  than  2000  gms),  and  low  (weighing  between  2000  to  2500  gms);  and  tried  to 
assess  the  impact  of  smoking  habits  of  mother  on  the  chance  of  falling  in  the  two 
low-birth-weight  categories.  Presence  of  uterine  irritability  in  mother,  and  mother’s 
weight  at  last  menstruation  period  were  considered  as  relevant  covariates.  It  was 
noted  that  smoking  mothers  had  a higher  relative  risk  of  having  a low-birth-weight 
child  when  compared  to  a nonsmoking  mother.  However,  the  risk  of  having  a very 
low-birth-weight  child  did  not  depend  on  smoking  significantly.  This  observation 
could  not  be  made  without  the  classification  of  the  data  into  multiple  low-birth- 
weight  groups,  illustrating  the  relevance  of  such  a type  of  analysis  in  certain 
situations. 

The  present  Chapter  intends  to  develop  an  approach  to  case-control  studies 
with  a multicategory  variable  D denoting  disease  states,  a completely  observed 
covariate  Z,  a vector  of  exposure  variables  or  risk  factors  X which  could  potentially 
contain  some  missing  observations.  The  exposure  variables  could  be  discrete  or 
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continuous,  and  I assume  that  X|D,  Z has  a distribution  coming  from  a exponential 
family  which  may  have  different  natural  parameters  across  strata.  I assume  a 
Dirichlet  process  with  a normal  base  measure  on  the  distribution  of  the  stratum 
effects  and  normal  priors  on  the  other  regression  parameters,  and  estimate  all  the 
parameters  via  a Markov  chain  Monte  Carlo  computing  scheme.  This  is  a major 
departure  from  the  usual  frequentist  as  well  as  the  Bayesian  approach  of  assuming 
that  the  distribution  of  exposure  variable  is  not  affected  by  any  stratum  effect 
except  through  the  measured  covariates.  This  last  assumption  may  not  hold  in 
many  situations.  For  example,  matching  for  cancer  patients  is  often  done  from 
their  family,  and  smoking  is  a natural  exposure  to  consider.  The  distribution  of  the 
exposure  may  depend  on  genetic  traits  in  the  family  which  may  affect  the  disease 
distribution  in  different  families  in  different  ways,  and  may  not  be  measurable  as 
a covariate.  The  present  Bayesian  approach  will  allow  us  to  model  stratum  effects 
on  the  exposure  distribution  for  highly  stratified  data,  and  account  for  missing 
exposure  information. 

The  outline  of  the  remaining  sections  is  as  follows.  In  Section  2,  I introduce 
notations  and  model  assumptions.  Section  3 contains  the  conditional  likelihood,  the 
priors  and  the  posterior  distributions.  In  Section  4,  I analyze  the  low-birth-weight 
data,  and  discuss  the  MCMC  computation  scheme.  Section  5 presents  the  results 
from  a small  simulation  study  comparing  the  proposed  Bayesian  semiparametric 
method  with  two  possible  parametric  Bayesian  alternatives.  Section  6 contains 
concluding  remarks. 

3.2  Model  and  Notation 

Let  Dij  denote  the  disease  state  of  the  j-th  individual  in  the  i-th  stratum 
S{.  Suppose  that  there  are  {K+ 1)  nominal  levels  of  the  disease  variable,  with 

= k denoting  disease  state  k,  k — 1, ..,  K,  and  D = 0 denoting  the  control 
group.  I assume  that  there  is  1 case  matched  with  M controls  in  each  stratum, 
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and  I have  n strata  in  all.  For  ease  of  notation,  I present  my  models  with  a single 
exposure  X,  but  the  model  could  easily  be  extended  to  the  case  when  one  has  a 
set  of  independent  multiple  exposures  each  having  a distribution  coming  from  the 
exponential  family.  The  disease  probabilities  for  each  of  the  K groups  are  modeled 
through  K logits  as  in  a multinomial  logistic  regression  model: 


Pr  [Dij  k | Si , Z ij , Xij  ] 

°gpr[Tly  = 0|Si,  Zij,Xij\ 


— Pok(Si)  + 0ik^ij  + PlkXij 


for  k — 1,  • • • , K. 

(3.1) 


Each  (3lk  is  a p x 1 column  vector,  and  Z%]  — (zjj\  ■■■  , Z\f)T  is  the  vector  of  p 
completely  observed  covariates. 

For  the  example  with  the  low-birth-weight  data  as  described  in  Section  3.1,  I 
have  two  disease  states  with  K — 2.  Then  exp(/?21)  signifies  the  odds  of  having  low- 
birth- weight  baby  for  a mother  who  smokes  relative  to  one  who  does  not  smoke, 
and  similarly  exp(/?22)  is  the  risk  of  a very  low-birth-weight  child  for  a mother  who 
smokes  relative  to  one  who  does  not.  The  conditional  probabilities  of  the  disease 
variable  given  the  covariate,  exposure,  and  the  stratum  are  given  by, 


p r ( IT j k j Si , Zjj,  Xij ) 


ex.p{(30k{Sj)  + f3TykZij  + p2  kXjj} 

1 + £f=l  exp^or^i)  + 0[TZ ^ + fcrXij} 


for  k = 1 , ■ ■ ■ , K. 

(3.2) 


and 


pr (Dij  = 0|5i,  Zij,Xij)  = 


1 + £f=i  exp{/30  r(Si)  + 0{rZ  ^ + p2TXij) 


(3.3) 


To  cover  both  discrete  as  well  as  continuous  exposures,  we  assume  a general  expo- 
nential family  of  distributions  for  the  exposure  variable  in  the  control  population 
with  respect  to  some  finite  dominating  measure  p,  i.e. , 


f(Xij\S1,  Z ij,  = 0)  = exp [€i:j{dijXij  - 6(%)}  + c(^-,  X^)}. 


(3.4) 
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The  natural  parameters  are  modeled  as  a regression  function  of  the  completely 
observed  covariates,  namely,  = 70;  + 7j  Zy,  where  'yf  = (711,  ■ ■ • ,7ip).  The 
dependence  of  the  exposure  distribution  on  the  stratum  is  captured  through  the 
varying  intercepts  70 

3.3  Likelihood,  Priors  and  Posteriors 

Before  writing  out  the  conditional  likelihood  one  needs  some  technical  results 
stated  in  the  following  as  Lemmas.  These  Lemmas  follow  by  repeating  essentially 
the  proofs  of  Lemmas  3 and  4 of  Chapter  2.  The  details  are  omitted. 

Lemma  1.  Under  assumptions  (3.2)-(3.4)  the  distribution  of  the  exposure  variable 
in  a given  disease  state  k , namely,  /(WyjS),  Zij , Dy  — k ) is  also  of  general 
exponential  form  with  scale  parameter  £,7 , and  natural  parameter  6*jk  = Oij  + ^~-1p2k 
for  k — 1,  ■ ■ ■ ,K. 

Lemma  2.  Under  the  same  set  of  assumptions, 

pr(i>- = (')  s'-Z-)  ^ exP{MSi)  + x exp[£ij{6(0Lfc)  - 6(%)}].  (3.5) 

I need  one  more  Lemma  to  write  out  the  conditional  likelihood. 

Lemma  3. 


pr(As  = k\Sj,  ZiS)/pr (Dis  = 0|5y  Zis) 

Ejt1  Pr(Aj  = *1  Su  Zy)/pr (Dzj  = 0|5j,  Zy) 

= MexP(^ZE  exp[^{6(^fc)  - 6(gfa)}]  for  — I . • • • . .17  tl.i-  1.  • • • .n. 
EjLi  exp(/3[fcZy)  exp[£is{6(0Lfc)  - &(%)}] 

(3.6) 


Without  loss  of  generality  one  may  assume  that  the  first  subject  in  each 
stratum  is  a case,  and  if  one  denotes  the  disease  state  (type  of  case)  in  stratum  i as 
ki  ( ki  could  assume  any  of  the  values  1,  then  the  conditional  likelihood  given 
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that  there  is  one  case  in  each  stratum  will  be  of  the  form: 

n 

Lc  oc  JJpr{Ai  = Dij  = 0(;  = 2,- • ■ ,M  + l),Xij(j  = 1,  ■ ■ ■ ,M  + l)|Sg 

M+l 

^ij  i ^ ^ Dij  ki  j- 


i=  1 


3= 1 


M+l 


= ^ n fiXiAs^z&Dy = o)} 

3= 2 


i=l 


^ pr(Ai  = h\Si,  Zg)  nS1  pr(Ai  = 0|Sg  Zy) 


n 


E^t1  pr(A ! = fci|5i,  za)  riX  pr(Ai  = 0|5i,  Z«) 


M+l  , 


M+l 


= n(/(^ii|5i,  Za,  Ai  - fci)  n f(xa\Si,  Zij,  Ai  = o)} 

3= 2 


2=1 


pr(Ai  = fci+2,  Zg)/pr(Ai  = 0++  Zg) 

=t  E^t1  Pr(A,-  - fci|5i,  Zjj)/pr(Aj  = 0|5g  Zy) ' 


n 


(3.7) 


In  case  of  partially  missing  exposure  variable  one  should  modify  the  above  likeli- 
hood in  the  following  way.  Let  <+  be  an  indicator  variable  for  missing  exposures 
defined  in  the  following  manner: 


Sij  = 


1 if  Xij  is  observed, 

0 if  is  missing,  i — 1,  • • • , n and  j = 1,  ■ • • , M + 1. 

I also  assume  that  the  distribution  of  does  not  involve  the  parameters  j3lk,  fok, 
7lt  and  7J  = (701,  • • • ,70 „).  The  conditional  likelihood  including  missingness  of  the 
exposure  variable  is  seen  to  be, 

n M+l 

A OC  HifiXnlSi,  Zg,  Ai  - h)&n  n /PA \Si,  z ij,  Dij  = 0)*+ 


i= 1 


n 


7=2 


Pr(Ai  = fcilA-  Zg)/pr(Ai  = O+i,  Zg) 


fi  Ejlt1  Pr(Aj  = Zy)/pr(Ai  - 0|5g  Z0-)  ■ 

This  likelihood  involves  the  parameters  /3+A  = (/3U,  ■ ■ • /3A  = ($21, 

7i,  and  7o- 


(3.8) 

+2  k), 
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I consider  mutually  independent  normal  priors:  (3lk  ~ Np(iipik,apikIp)  for 
k = 1,-  • • ,K,  /32  ~ NK(iJ,g2,al2IK),  and  7X  ~ iVp(^71,  Ip).  Following  are  the 
full  conditional  distributions  of  the  parameters. 


ex 


Plfc  


n”=1{ES1exp(^'/311^  + tf/312^)  X 


hi) 


(2) \ v (l+cxp(gf,Q) 


for  A:  = 1,2, 


(3.9) 


^(/32|-)  oc 


eXP(-2^^2^) 

02 


rrn  rv-\M+l  /i  T/q  ^(1)  i lT  zQ  >7(2)  \ w (H"exP  1 

rii=l{X/j=l  exp(^j  PllZ^j  4"  P12Z-  ) (1+exp  (0y))  J 

n 

nil  + exp^)}1-^,  (3.10) 

i=l 


7r(7i|-)  a 


exp(— ^r-C/71C/71) 


where 


t— rn  / » T /d  ^(l)  1 lT /o  /7(2)\  w (1+exp  (£T))  -. 

nj=l{E,=l  exp(^i  011^ij  + 0nZij  ) X (i+exp(0y))  / 

z — 1 j — ^ 


(fc) 


- ^lfc  ~ h01k  ~ alik  ]C  hiZi 
2=1 
n 

U02  = /32  — ^p2  — crp2  'y  ( 5iihiXn, 


i=i 

n M+l 


C77i  — 7i  ASi  cr7l  yi  fiijZijXij, 
i= 1 j=l 

and  hi  is  defined  as  hi  — (ha,  ■ ■ • , hix)T  where, 


(3.11) 


J 1 if  Du  — r 

hir  \ 

0 otherwise  , i — 1,  • ■ • , n and  r — 1,  ■ ■ ■ , K. 

7r(7o*|-)  oc  V — — rinfc(/311,/312,/32,7i,70s,s  ^ i)I^ok)Lc 

a + n — 1 

kpi 

+Wi{0n,0\2,02,'yi>'Yos,s  ^ i)/i(7oi|’)> 


(3.12) 
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where 


expb^-Co-^E^EjlT1 


/»(7o»|-)  oc 


2cr0 


&ij  Xij  ) 


niLJES1  eMhfflnZV  + hJ(312Z\f) 


(l+exp  -j 

X (l+exp  (0i3))J 


3— * 


2=1 


and  wr,  r = 1,  • • • , n are  weight  functions  such  that  7r(70j|-)  is  a proper  density.  One 
may  note  that  if  X is  completely  observed  Sij  — 1 Vi,  j. 

One  of  the  key  features  of  my  approach  is  to  retain  the  nuisance  parameters 
70 i in  the  model,  and  to  put  a Dirichlet  with  normal  mixture  prior  on  them,  i.e., 
7oi|C  ~ G,  where  G ~ DP(aG0 ) and  G0  is  A!(Co>°o)-  As  mentioned  in  Chapter  2 
and  using  Antoniak  (1974),  I have 


7oi|7ofc(&  7^  i) 


a 


a + n 


-N(-\Co,  °o)  + 


a + n~lk 


Ao  k ( ' ) 1 


(3.13) 


where  I is  the  indicator  function.  The  result  allows  the  possibility  of  equal  values 
of  some  7oi  as  well.  As  one  will  notice  in  my  simulations,  the  fact  that  it  can  allow 
for  equal  values  of  70 j,  plays  an  important  role  in  making  the  procedure  robust  over 
a wide  spectrum  of  scenarios,  from  widely  varying  70i’s  to  the  case  when  they  are 
all  equal. 

For  the  data  analysis,  and  simulations  I compared  the  Bayesian  semipara- 
metric  (BSP)  method  with  two  possible  parametric  Bayesian  alternatives.  The 
first  one  is  a parametric  Bayesian  analogue  of  the  method  proposed  by  Satten  and 
Carroll  (2000)  for  matched  case  control  studies  with  a single  disease  state.  Satten 
and  Carroll  (2000)  assumed  a constant  stratum  effect  on  the  exposure  distribution, 
i.e.,  7oi  = 70.  In  the  parametric  Bayesian  analogue  of  their  method  (denoted  by 
PBC,  C standing  for  constant  stratum  effect)  in  the  context  of  multiple  disease 
states,  I consider  a normal  distribution  as  a prior  on  this  common  stratum  effect 
parameter  70,  and  carry  out  Bayesian  analysis.  The  other  parametric  Bayesian 
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alternative  (denoted  by  PBV,  V standing  for  varying  stratum  effects)  allows  for 
possibly  varying  70*,  and  assumes  i.i.d.  normal  prior  on  each  70 *■ 

Remark  1.  It  follows  from  (3.13)  that  for  very  large  values  of  a the  BSP  method 
is  equivalent  to  the  PBV  method  whereas  PBC  is  a special  case  of  BSP  method 
with  a equals  to  zero.  In  my  numerical  work  I assumed  a Gamma  prior  on  a,  and 
resampled  from  the  full  conditional  distribution  of  a using  a latent  beta  variable  as 
prescribed  in  Escobar  and  West  (1995). 

Remark  2.  The  entire  analysis  carries  through  if  each  stratum  contains  varying 
number  of  controls,  with  equations  (3.5)-(3.9)  remaining  essentially  the  same  with 
M replaced  by  Mi  in  the  1-th  stratum. 

The  estimation  of  the  parameters  is  done  by  the  Markov  chain  Monte  Carlo 
numerical  integration  scheme.  To  generate  random  numbers  from  the  posterior 
distributions  of  the  parameters  I use  a componentwise  Metropolis  Hastings  scheme. 
The  computation  scheme  along  with  the  analysis  of  the  low  birth-weight  study  are 
described  in  the  following  section. 

3.4  Example  and  Computing  Scheme 

In  the  previous  sections  I discussed  the  general  methodology  which  I now 
apply  to  the  matched  case-control  study  for  low-birth-weight  data  as  described  in 
Section  1.  The  matched  data  contain  29  strata,  and  each  stratum  has  one  case, 
and  3 controls.  I denote  the  low-birth- weight,  and  the  very  low-birth- weight  group 
as  disease  state  1,  and  2 respectively.  One  can  possibly  think  of  many  different 
models  for  explaining  the  disease  in  terms  of  the  possible  covariates  recorded  in  the 
data  set.  I consider  smoking  status  of  mother  as  a single  exposure  variable.  Two 
other  covariates,  a binary  variable  denoting  presence  of  uterine  irritability  (UI)  in 
mother,  and  weight  of  the  mother  at  last  menstrual  period  (LWT)  are  also  included 
in  the  model. 
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As  a starting  point,  I separated  the  29  strata  into  two  groups  depending 
on  whether  the  case  belonged  to  low-birth-weight  category  1 or  2.  I formed  two 
cross-classification  tables  of  birth  weight  category  versus  smoking  status  of  mother 
during  pregnancy  period  for  these  two  separate  matched  samples,  and  noted  that, 
OR(1,0)  = 3.4,  and  OR(2,0)  = 1.917,  where  OR(fc,  0)  denotes  the  odds  ratio  of 
maternal  smoking  habits  for  birth  weight  group  = k versus  normal  birth  weight 
group  = 0.  The  two  odds  ratios  demonstrate  that  the  odds  of  having  a low-birth- 
weight  baby  for  a smoking  mother  as  opposed  to  a nonsmoking  mother  is  higher 
in  category  1,  whereas  for  category  2,  I notice  a relatively  weaker  behavior  in  the 
same  direction.  The  difference  in  the  odds  ratios  in  these  two  tables  led  us  to  use 
this  data  as  a testbed  example  to  illustrate  my  methods. 

For  the  proposed  analysis  I have  a stochastic  distribution  on  the  exposure 
variable  which  belongs  to  the  exponential  family.  The  binary  variable  smoking 
status  is  assumed  to  follow  a Bernoulli  distribution: 


f(Xij\Dij  = 0,Zij,Si)  = p?'3(l-pxj) 


\1— . 


(3.14) 


so  here  ^ = 1,  % = ln{ p^r),  &(%)  = M1  + exp(%)),  and  c(Xij,£ij)  - 0.  Also 
here  K — 2,  p — 2,  n — 29,  and  M — 3.  Using  Lemmas  1,  2,  and  3,  I obtain  the 
conditional  likelihood  for  the  whole  data: 


Lc  oc  jjfexp^Xi!  - Ml  + exP  Mi))} 


i=l 


M+l 


x exp [^{OijXij  - Ml  + exP  (%))}] 


1=2 


exD (hT 3 4-  hT  3 Z M x d+exp^qh 

exp Kni  Pile'll  + ni  ) x (l+exp (On)) 


EfT  ™p(Wn4‘>  + WkW)  x 


(l+exp(0t*)) 


(3.15) 


where  $*•  = dij  + hj (32-  Since  the  value  of  k (i.e. , disease  type)  is  completely 


determined  by  knowing  the  stratum  one  can  omit  the  subscript  k for  0*jk  in  the 
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above  expression;  Z^\  and  Z ^ denote  the  observed  value  of  the  two  covariates  UI, 
and  LWT  for  the  s-th  subject  in  the  i-th  stratum  respectively. 

Our  analysis  is  based  on  normal  priors  centered  at  zero  with  large  variances 
for  all  the  regression  parameters.  In  instances  (such  as  mine)  when  prior  elicitation 
is  not  possible,  these  priors  usually  lead  to  posteriors  relying  more  heavily  on  the 
data  and  protect  against  model  failures.  In  many  real  applications,  the  practitioner 
may  have  a more  precise  knowledge  about  the  sign  and  magnitude  of  the  relative 
risk  parameters,  and  can  suitably  change  the  prior  if  necessary.  We  conducted  a 
sensitivity  analysis  with  several  choices  of  prior  parameters.  For  the  regression 
parameters,  and  the  normal  base  measure  of  the  Dirichlet  process,  I experimented 
with  normal  priors  centered  at  zero,  and  with  variances  2,  4,  5,  6,  and  9.  I used  a 
Gamma  prior  on  the  concentration  parameter  a of  the  Dirichlet  process  and  ran 
my  analysis  with  both  shape,  and  size  parameter  set  at  0.5,  1,  2,  4,  10,  40,  100, 

200,  and  with  many  other  possible  priors  like  G(0.5,4),  G(l,  10),  G(10,  40),  G(100, 
40),  G(200,  10).  I noted  that  the  ultimate  numerical  estimates  are  reasonably 
stable  over  a varying  range  of  prior  parameters. 

The  results  are  reported  for  independent  N(0,5)  prior  on  each  component  of 
/3n,  /312,  and  /32,  N(0,6)  as  the  mixing  normal  distribution  for  the  Dirichlet  process 
prior,  and  a Gamma(2,  2)  prior  for  the  mixing  parameter  a. 

I used  componentwise  Metropolis  Hastings  algorithm  to  generate  random 
numbers  from  the  full  conditionals  of  the  parameters.  For  generating  observations 
from  the  posterior  distribution  of  j0i,  i = 1, . . . , n,  I used  an  algorithm  proposed 
by  Neal  (2000)  for  simulating  observations  from  posteriors  of  Dirichlet  mixtures  for 
nonconjugate  cases.  The  details  of  the  computation  scheme  are  given  in  Chapter  2. 

I ran  the  chain  typically  from  7000-10000  iterations,  and  the  inference  was  based  on 


the  last  3000  MCMC  runs. 
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Table  3-1  contains  the  posterior  means,  posterior  standard  deviations,  and 
95%  HPD  credible  intervals  for  the  parameters  of  interest  under  the  proposed 
Bayesian  Semiparametric  method  (BSP),  and  the  parametric  Bayes  (PBC  and 
PBV)  methods  as  discussed  before.  In  the  parametric  Bayes  methods  I used  a 
N(0,6)  prior  on  the  constant  y0  (in  PBC),  and  i.i.d.  N(0,6)  prior  on  varying  70i’s 
(in  PBV).  To  illustrate  the  methods  in  presence  of  missingness,  I deleted  40%  of 
exposure  values  completely  at  random  (in  the  sense  of  Little  and  Rubin,  1987)  and 
reran  the  three  analyses.  The  results  are  presented  in  Table  3-2. 

The  full  data  analysis  indicates  that  smoking  of  mother  is  a significant  risk 
factor  for  low-birth-weight  category  (category  1)  and  is  not  very  significant  in 
the  very  low-birth- weight  category  (category  2).  UI  on  the  other  hand  shows  an 
opposite  association,  showing  significance  in  category  2,  and  almost  no  significance 
in  category  1.  LWT  does  not  seem  to  be  a significant  covariate  in  any  of  the 
categories.  The  BSP,  and  the  PBV  methods  are  in  closer  agreement  whereas  the 
PBC  estimates  show  some  numerical  differences. 

Figure  3-1  shows  a plot  of  the  variance  of  the  29  stratum  effects  in  the  last 
3000  MCMC  samples.  The  average  variance  is  approximately  2.3  showing  that 
there  indeed  exists  variability  in  the  stratum  effects.  As  a result  the  BSP  and  PBV 
methods  which  account  for  this  variability  are  in  close  agreement,  whereas  the 
PBC  method  assuming  constant  stratum  effect  differs  numerically  from  these  two 
methods. 

For  the  analysis  with  40%  missing  observations  on  smoking  one  notices  that 
the  estimates  corresponding  to  smoking  in  the  BSP  method  comes  closer  to  their 
full  data  counterparts  even  though  the  inferences  are  same  in  all  three  methods. 

As  one  might  expect,  with  40%  missingness,  the  parameter  estimates  for  smoking 
lose  precision,  and  the  effect  of  smoking  now  appears  to  be  not  significant  in 
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both  categories  1 and  2.  Inferences  on  the  other  two  covariates  remain  essentially 
unchanged  when  compared  to  full  data  inferences. 


i 1 1 1 1 1 1 

1 2 3 4 5 6 7 

Low  Birth  Weight  Data:  Variability  of  the  Stratum  Effects 


Figure  3-1:  Plot  of  the  variability  of  7oi’s  for  the  low-birth- weight  study  example. 
Estimates  of  the  29  70,’s  were  collected  for  each  of  the  last  3000  MCMC  samples. 
Variances  of  these  29  values  were  then  calculated  for  each  run.  The  histogram  is  of 
these  3000  variance  values  with  a kernel  density  estimate  overlayed  on  it. 


I also  analyzed  the  data  after  collapsing  category  1 and  category  2 into 
only  one  category  (birthweight  less  than  2500  gms).  I carried  out  the  Bayesian 
analysis  for  a simple  matched  case-control  data  (Sinha  et  al.,  2003),  and  the  usual 
conditional  logistic  regression  (CLR)  analysis  (Breslow  and  Day,  1980).  Table 
3-3  shows  that  both  BSP  and  PBV  methods  assuming  varying  stratum  effects 
bring  out  the  effect  of  mother’s  smoking  on  having  low-birth-weight  newborns  and 
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produce  very  similar  results.  The  PBC  and  the  CLR  methods,  assuming  constant 
stratum  effect  are  in  closer  agreement  with  each  other,  and  they  do  demonstrate 
the  effect  of  smoking  but  not  as  precisely  as  the  other  two  methods  which  allow 
varying  stratum  effects.  Figure  3-1  again  demonstrates  the  differences  in  results 
between  these  two  classes  of  models.  Obviously,  without  the  finer  classification  into 
two  weight  categories,  the  fact  that  smoking  is  not  so  significant  for  category  2 and 
UI  is  appreciably  significant  for  category  2 cannot  be  concluded  from  looking  at  the 
overall  analysis.  Thus  the  multicategory  analysis  may  render  some  useful  additional 
information  to  the  practitioner. 

3.5  Simulation  Study 

In  the  low-birth-weight  data  one  can  notice  appreciable  variability  in  the 
stratum  effects.  In  practice,  the  experimenter  may  not  have  a prior  idea  about  the 
nature  of  variability  among  stratum  effects  and  there  could  be  situations  where  the 
standard  model  assumption  of  constant  stratum  effect,  i.e.  70 i = 70  hold.  Sinha 
et  al.  (2003)  contains  an  example  from  equine  epidemiology  where  the  stratum 
effects  have  very  small  variability.  I conducted  a simulation  study  to  ascertain 
the  robustness  of  the  BSP  method  even  when  variability  in  the  stratum  effects  is 
negligible. 

In  order  to  simulate  a realistic  dataset  for  comparing  the  BSP,  PBC,  and 
PBV  methods,  I decided  to  use  the  low-birth-weight  data  itself.  I generated 
a hypothetical  1:1  matched  data  set  with  50  strata  with  one  binary  exposure 
variable  (corresponding  to  X,  smoking  status  of  mother)  and  one  binary  co variate 
(corresponding  to  Z,  presence  of  uterine  irritability).  The  true  values  for  /?n,  P12, 
/32i,  and  P22  are  chosen  to  be  0.20,  1.80,  1.40,  and  0.4  respectively,  close  to  the 
estimates  obtained  by  analyzing  the  low-birth-weight  data  by  the  BSP  method  as 
presented  in  Table  3-1. 
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In  order  to  elicit  values  for  71,  the  coefficient  of  Z in  the  natural  parame- 
ter of  the  exposure  distribution  , namely  6^  on  Z I ran  a logistic  regression  of 
X(SMOKE)  on  Z(UI)  using  the  control  sample,  and  the  fitted  model  turned  out  to 
be  logit(pr(Xij ))  = —0.734  + 0.579  Ztj.  Accordingly,  in  the  simulation  study  I used 
71  = 0.60. 

I followed  the  structure  of  my  models  as  described  before  for  simulating  Z,  X, 
and  the  trinomial  variable  D,  indicating  the  birth  weight  category.  For  the  low- 
birth-weight  data  occurrence  of  uterine  irritability  was  reported  in  approximately 
20%  of  the  patients.  Thus  the  completely  observed  covariate  Z was  generated  first 
as  a Bernoulli  variable  with  success  probability  0.20  . Second,  we  generated  the 
trinomial  disease  variable  D.  For  the  ith  stratum,  one  should  note  that 


— 0\Za,  Zi2,  Du  + Di2  — 1 or  2,  Si) 


where,  for  j — 1,2, 


1 


n (a  v ,1  + exp  (%  + /?2i)  1 a r,  \ 1 + exP  i@ij  + A 

= 8xp(ftlZ„)  1 + exp(ftj)  + exp(ft2Z„)  1+expJ(%) 


I generated  a Bernoulli  random  variable  with  the  above  success  probability, 
and  if  this  variable  assumed  a value  1 , the  simulated  value  for  DiX  was  taken  to 
be  0 implying  that  the  first  subject  in  the  i-th  stratum  is  a member  of  the  control 
population.  Let,  for  k = 1,2, 


Pik 


l + exp[^(/312-/3n)]1+exp(e"+ter 


(3.17) 


I l+exp  (0ik+02i) 

If  Du  = 0,  I determine  Da  according  to  a Bernoulli  draw  with  success  probability 
p* 2 ; set  Di 2 — 1 if  it  results  in  a success  otherwise  set  Di2  — 2 . If  Du  ^ 0,  set 
Di 2 — 0,  and  I determine  the  value  DlX  according  to  a Bernoulli  draw  with  success 
probability  p*x;  if  this  results  in  a success  DiX  — 1,  otherwise  DiX  = 2. 
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Conditional  on  the  value  of  D,  I proceed  to  simulate  the  exposure  X.  If 
Dij  — 0,  we  generated  a binary  exposure  with  success  probability  given  by 
logit(pij)  = 70 i + 71  Zij.  If  D y = k,  I generate  the  binary  exposure  variable  with 
success  probability,  logit(pij ) = j0i  + 71  Ztj  + /32k,  j,  k = 1,2. 

I performed  two  sets  of  simulations,  one  with  a constant  value  of  70;  namely 
—1.00  the  other  with  a relatively  varying  set  of  70i’s  simulated  from  a normal 
distribution  with  mean  —0.5,  and  standard  deviation  1.5.  I assumed  IV(0,5) 
prior  on  all  the  relative  risk  parameters,  and  a Gamma(2,  2)  prior  for  a.  In  all 
my  simulations  I used  identical  parameters  for  the  normal  distribution  which  is 
assumed  to  be  prior  on  j0  in  PBC,  and  the  i.i.d.  prior  on  j0i  in  PBV,  and  also 
as  the  mixing  distribution  in  the  Dirichlet  Process  prior  for  BSP  (N( 0,  6)  in  this 
case).  We  replicated  the  simulation  50  times,  generating  50  different  datasets,  and 
obtained  the  parameter  estimates  by  above  mentioned  methods,  and  computed 
their  average  and  MSE.  For  each  replication  I also  generated  data  with  30% 
exposure  values  missing  completely  at  random  and  recalculated  all  the  estimates. 

The  simulation  results  presented  in  Tables  3-4  , and  3-5  illustrate  that  for 
constant  stratum  effect,  the  three  methods  are  comparable  with  PBV  estimates 
being  furthest  from  the  true  parameters  of  interest  whereas  for  varying  stratum 
effect  the  BSP  and  PBV  methods  have  a clear  edge  over  the  PBC  method.  Overall 
BSP  seems  to  be  the  more  robust  choice  as  at  the  onset  of  a study  one  is  not  sure 
about  the  nature  of  variability  in  the  stratum  effects. 

3.6  Concluding  Remarks 

In  this  Chapter  I proposed  a semiparametric  Bayesian  method  to  analyze 
matched  case-control  data  with  more  than  one  disease  state  and  illustrate  the 
methods  with  a real  example.  The  simulation  results  indicate  that  in  presence 
of  stratum  variability  and  missing  data  the  Bayesian  semiparametric  method 
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is  superior  to  the  parametric  Bayesian  alternatives.  All  three  methods  perform 
comparably  with  constant  stratum  effects. 

Our  proposed  model  considers  a nondeterministic  exposure  variable  having 
a probability  distribution  belonging  to  the  exponential  family.  Moreover,  the 
distribution  of  the  exposure  could  be  different  in  different  strata.  Our  model  takes 
into  account  both  discrete  and  continuous  exposure  along  with  possible  missingness 
in  the  exposure  variable.  The  growing  number  of  stratum  effect  parameters  are 
modeled  in  a semiparametric  Bayesian  way  to  overcome  the  inconsistency  problems 
arising  out  of  classical  analysis  of  such  matched  data.  The  computations  involving 
a Dirichlet  process  prior  with  a mixing  normal  distribution  are  done  through  a 
suitable  MCMC  scheme,  and  enable  us  to  obtain  estimates  of  the  parameters 
of  interest.  The  method  could  be  extended  to  multiple  exposures  having  an 
underlying  association  pattern  as  well.  The  general  framework  is  extremely 
flexible  for  being  used  in  unorthodox  data  situations  involving  missingness  and 
measurement  error  as  well  as  incorporating  widely  different  types  of  exposure 
variables  that  one  may  come  across  in  practice. 


Table  3-1:  Analysis  of  low-birth- weight  data  using  full  dataset  with  multiple  disease  states 
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Table  3-4:  Results  of  the  simulation  study  for  multiple  disease  states,  part-I 


Full  data,  fixed  70 i 

= -1.00 

Method 

P11 

P12 

P21 

P22 

7i 

BSP 

Mean 

0.23 

1.88 

1.44 

0 .40 

0.43 

MSE 

0.22 

6.57 

12.50 

6.55 

36.54 

PBC 

Mean 

0.19 

1.92 

1.43 

0.43 

0.51 

MSE 

0.57 

14.59 

14.42 

10.41 

20.26 

PBV 

Mean 

0.20 

1.88 

1.43 

0.32 

0.41 

MSE 

0.53 

10.56 

9.71 

15.50 

48.36 

30%  missing  data,  fixec 

7o  i = 

-1.00 

Method 

A 1 

P12 

P21 

P22 

7i 

BSP 

Mean 

0.20 

1.90 

1.45 

0.36 

0.44 

MSE 

0.17 

31.78 

50.07 

37.83 

81.31 

PBC 

Mean 

0.20 

1.89 

1.47 

0.46 

0.47 

MSE 

4.79 

10.05 

22.09 

11.33 

23.98 

PBV 

Mean 

0.20 

1.90 

1.35 

0.28 

0.38 

MSE 

0.56 

9.89 

57.05 

67.88 

83.19 

Here  ’’Mean”  is  the  simulated  mean,  while  MSE  is  the  mean  squared  error  xlOOO. 
The  true  parameter  values  are  fin  — 0.20,  fin  = 1.80,  fci  — 1-40,  P22  = 0.40,  and 
7i  = 0.60  . BSP  stands  for  Bayesian  semiparametric  method  whereas  PBC,  and 
PBV  stand  for  parametric  Bayes  methods  assuming  constant,  and  varying  stratum 

effects  respectively. 
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Table  3-5:  Results  of  the  simulation  study  for  multiple  disease  states,  part-II 


Full  data, 

varying  j0i  ~ 

N(- 0.5 

,2.25) 

Method 

Ai 

A 2 

Ai 

A2 

7i 

BSP 

Mean 

0.19 

1.89 

1.43 

0.42 

0.50 

MSE 

0.94 

8.20 

8.58 

10.56 

13.47 

PBC 

Mean 

0.20 

1.89 

1.52 

0.48 

0.53 

MSE 

0.70 

8.48 

27.98 

13.36 

97.75 

PBV 

Mean 

0.19 

1.92 

1.44 

0.43 

0.51 

MSE 

0.65 

12.84 

11.15 

9.07 

13.25 

30%  missing  data,  varying  y0i  ~ N(- 

-0.5,2.25) 

Method 

A i 

Pl2 

Ai 

A2 

7i 

BSP 

Mean 

0.20 

1.91 

1.50 

0.44 

0.50 

MSE 

0.59 

11.86 

16.96 

11.58 

19.56 

PBC 

Mean 

0.20 

1.92 

1.54 

0.49 

0.53 

MSE 

0.46 

14.04 

30.28 

14.68 

15.54 

PBV 

Mean 

0.20 

1.94 

1.49 

0.44 

0.50 

MSE 

0.76 

14.31 

16.15 

14.42 

19.35 

Here  ’’Mean”  is  the  simulated  mean,  while  MSE  is  the  mean  squared  error  xlOOO. 
The  true  parameter  values  are  Ai  = 0.20,  A2  = 1.80,  Ai  = 1.40,  A2  = 0.40,  and 
7i  = 0.60  . BSP  stands  for  Bayesian  semiparametric  method  whereas  PBC,  and 
PBV  stand  for  parametric  Bayes  methods  assuming  constant,  and  varying  stratum 

effects  respectively. 


CHAPTER  4 

SEMIPARAMETRIC  BAYESIAN  ANALYSIS  OF  MATCHED 
CASE-CONTROL  STUDIES  WHEN  EXPOSURE  VARIABLES 
ARE  MEASURED  WITH  ERROR 

4.1  Introduction 

Errors  in  measuring  exposures  can  be  an  important  source  of  bias  in  epidemio- 
logic studies.  In  many  epidemiologic  studies  exposure  variables  can  not  be  observed 
without  error,  for  instance,  measuring  a patient’s  blood  pressure,  measurement 
on  dietary  intake  etc.  As  a result,  the  estimated  odds  ratio  between  a disease  and 
the  potential  risk  factors  (relative  risk  for  rare  diseases)  is  subject  to  bias.  This 
Chapter  proposes  a method  to  handle  measurement  error  in  exposure  variables  in 
matched  case-control  study  when  the  exposure  variables  contain  partial  missing  ob- 
servations. In  the  likelihood  setup,  I assume  a distribution  of  the  exposure  variables 
in  control  population  given  a completely  observed  covariate  and  stratum  effect.  I 
also  assume  a prospective  model  of  the  disease  variable  in  terms  of  completely  ob- 
served covariate,  stratum  effect,  and  the  exposure  variables.  A simple  additive  error 
structure  is  assumed  for  the  measurement  error.  The  novelty  of  my  approach  is  to 
capture  the  unmeasured  stratum  heterogeneity  in  the  distribution  of  the  exposure 
variables  by  using  a Dirichlet  process  prior.  For  all  model  parameters  except  the 
stratum  effect  parameters,  I use  parametric  prior  and  the  they  are  estimated  in  a 
fully  Bayesian  framework. 

The  topic  of  measurement  error  has  drawn  considerable  attention  for  decades 
(Fuller,  1987  and  Carroll  et  ah,  1995).  For  case-control  problem,  a variety  of  meth- 
ods have  been  proposed  to  handle  measurement  error.  In  most  of  the  situations 
a fraction  of  case-control  data  provide  information  on  response  D,  and  a fallible 
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surrogate  T of  the  true  exposure  variable  X ; and  validation  data  consist  of  all  the 
three  variables  (D,X,T).  Prentice  (1982)  proposed  a method  to  estimate  relative 
risk  parameter  in  covariate  measurement  error  in  failure  time  data;  Carroll  and 
Wand  (1991)  semiparametrically  modeled  the  distribution  of  surrogate  T and  esti- 
mated the  parameters  of  logistic  regression  model.  Carroll  et  al.  (1993)  proposed 
a pseudolikelihood  method  to  handle  measurement  error  in  case-control  studies. 
They  also  provided  Prentice-Pyke  type  equivalence  of  prospective  and  retrospective 
analysis  of  case-control  studies  in  the  presence  of  measurement  error.  Armstrong  et 
al.  (1989)  used  multivariate  discriminant  analysis  to  model  the  distribution  of  the 
exposure  variables  for  matched  case-control  data.  McShane  et  al.  (2001)  proposed 
conditional  score  procedure  for  obtaining  bias-corrected  estimates  of  log-odds  ratio 
in  a matched  case-control  study.  In  the  Bayesian  framework  Muller  and  Roeder 
(1997)  handeled  measurement  error  in  unmatched  case-control  studies  by  assuming 
a mixture  of  multivariate  normal  distribution  of  the  exposure  variables.  They 
used  a Dirichlet  process  prior  on  the  mixture  components  of  the  distributions. 

Later  on,  Seaman  and  Richardson  (2001)  extended  the  Miiller-Roeder  approach 
to  the  situation  of  any  number  of  categorical  covariates  and  discretized  continuous 
covariates. 

Also,  as  mentioned  in  previous  Chapters,  there  are  plethora  of  works  on 
missing  exposure  variable  in  case-control  studies.  But  to  the  best  of  my  knowledge, 
there  does  not  exist  any  article  which  simultaneously  addresses  both  missingness 
and  measurement  error  in  an  exposure  variable  in  a unified  framework. 

This  Chapter  develops  a Bayesian  approach  with  a binary  disease  variable 
D,  completely  observed  covariate  Z and  a surrogate  variable  T of  true  exposure 
variable  X , with  partial  missing  observations.  Some  additional  data,  called  validity 
data,  has  measurement  on  T as  well  as  on  the  true  value  of  the  exposure  variable 
X.  I start  with  the  usual  additive  model  of  error,  and  assume  a multivariate 
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normal  distribution  of  exposure  variable  among  the  controls  D — 0 given  Z,  and 
stratum  S.  The  mean  of  X is  modeled  in  terms  of  completely  observed  covariates 
Z,  and  stratum  specific  random  intercept  terms.  Further  I assume  that  the  errors 
are  nondifferential  and  follow  a multivariate  normal  distribution.  Throughout  this 
Chapter  I assume  that  the  surrogate  T is  missing  at  random  in  the  terminology 
of  Little  and  Rubin  (1987).  I work  with  a joint  conditional  likelihood  of  D , 

T,  and  A,  the  vector  of  missing  value  indicator,  given  the  stratum  effects,  the 
completely  observed  covariates  Z , and  the  number  of  cases  in  each  stratum.  I use 
Dirichlet  process  prior  on  the  stratum  heterogeneity  parameters  with  a normal  base 
measure.  As  we  will  see,  there  is  no  explicit  form  of  the  posterior  distribution  of 
the  parameters,  so  I use  Gibbs  sampler  with  Metropolis-Hastings  algorithm  to  draw 
the  random  numbers  from  the  respective  conditional  distributions. 

The  outline  of  this  Chapter  is  follow.  Section  4.2  describe  the  model,  and 
notation  for  measurement  error  while  Section  4.3  deals  with  likelihood,  prior 
and  posterior  distribution  of  the  parameters  for  missing  value  and  measurement 
error  in  exposure  variables.  A real  data  example  is  presented  in  Section  4.4  while 
computational  details  are  considered  in  Section  4.5  followed  by  conclusion. 

Before  concluding  this  section  I highlight  some  novel  features  of  this  Chapter. 
First,  the  present  Chapter  consolidates  missingness  and  measurement  error  under  a 
unique  framework.  Second,  by  using  semiparametric  Bayesian  approach  we  capture 
unmeasured  stratum  heterogeneity  in  the  exposure  distribution.  Finally,  the  data 
example  shows  the  need  of  undertaking  this  approach  to  handle  irregular  data 
situation. 

4.2  Model  and  Notation 

The  study  design  considers  s strata  (matched  sets),  and  each  stratum  consists 
of  1 case  and  M controls.  Let  D be  the  binary  disease  indicator.  The  prospective 
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disease  model  for  the  jth  subject  in  the  ith  stratum  is 


where  (3o(Sl)  is  the  stratum  specific  random  intercept,  (3l  and  (32  are  the  log  odds 
ratio  parameter  corresponding  to  q x 1 vector  of  completely  observed  covariates  Z, 
and  p x 1 vector  of  exposure  variables  X respectively. 

The  usual  method  of  analysis  is  conditional  logistic  regression  (CLR),  and  the 
estimates  of  the  coefficients  /31,  and  /32  are  obtained  by  maximizing  the  conditional 
likelihood 


where  = (Du,  • • • , Am+i)  is  the  vector  of  binary  disease  variables  for  the  z-th 
stratum.  Without  loss  of  generality  one  can  assume  T,  = (1,  0 ■ • • , 0)  Vi,  and  then 
(4.2)  reduces  to 


When  there  are  measurement  errors  in  the  exposure  variables  X,  tests,  and  esti- 
mates of  (32  derived  from  the  above  conditional  likelihood  are  biased.  Armstrong 
et  al.  (1989)  and  McShane  et  al.  (2001)  proposed  some  methods  of  how  to  get 
bias  corrected  estimates  of  the  log-odds  ratio  parameters  in  the  presence  of  mea- 
surement error.  But  their  methods  are  also  inefficient  as  CLR  in  the  presence  of 
missing  observations  on  the  exposure  variables.  Naive  application  of  these  method 
discard  an  entire  stratum  if  the  case  in  that  stratum  is  not  completely  observed. 
Also,  these  methods  ignore  completely  observed  subjects,  if  no  matching  case  or 


M+ 1 


(4.2) 


M+ 1 


(4.3) 
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controls  are  completely  observed,  in  addition  to  incompletely  observed  subjects. 
Moreover,  the  above  mentioned  methods  of  handling  measurement  error  produce 
biased  estimate  of  the  parameters  unless  the  data  are  missing  completely  at  ran- 
dom (MCAR).  In  the  following  Section,  I propose  a Bayesian  method  to  deal  with 
both  missing  values  and  measurement  error  in  exposure  varaibles. 

4.3  Likelihood,  Priors,  and  Posteriors 
Suppose  that  the  set  of  exposure  variables  X is  measured  with  error,  while 
the  set  of  other  covariates  Z is  measured  without  error.  Let  T be  the  error  prone 
observation  for  X,  that  is  Ttj  = XtJ  + Uijy  j = 1,  • • • , M + 1;  i = 1,  ■ • ■ , s;  and 
conditional  on  X,  D is  independent  of  T 

I assume  that  in  the  control  population  the  conditional  distribution  of  the 
exposure  variables  follow  a multivariate  normal  distribution,  i.e. , 

[Xij\Dij  = 0,  Zij,  Si]  ~ Np(6i:j.  Dx),  (4.4) 

where  6 ^ — A0i  + Af  Z^.  Here  Ao^  captures  the  unexplained  stratum  heterogeneity 
in  the  exposure  distribution  which  is  not  measured  through  Z . Further  I assume 
the  nondifferential  error  model  (as  described  in  Carroll  and  Stefanski,  1994) 

U ij  ~ NP(C,  Su)  Vt,j.  (4.5) 

The  first  step  is  to  write  the  full  likelihood  handling  the  measurement  error  in 
exposure  variables.  Using  (4.4)  in  Lemma  1 of  Chapter  2 one  obtains 

j exp{p0(Si)  + 0[ Z^  + (3 IXij} 

x— ^exp{(Xy  - eij)D~1(Xij  - ei3)}dXij 

\Dx\2 

exp [Po{Si)  + Pi  Zi3  + PlGij  + ^ PlDxp2 ] (4.6) 


pr(Aj  = 1[ Si,  Z^) 

pr (Dij  - 0| Si,  Z^) 
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Using  Lemma  2 of  Chapter  2 one  can  easily  obtain  the  distribution  of  Xij  in  the 
case  population  as, 


[Xij\Dij  — 1,  Zij,  5i]  ~ N(9ij  + Dx(32 , Dx).  (4.7) 

Now  using  additive  error  structure,  (4.4),  (4.5),  and  (4.7)  one  obtains  the  distribu- 
tion of  T in  control  and  case  population. 

[Tij\Dij  = 0,  Zij,  ~ N(C  + 9ij,  Dx  + Du)  (4-8) 

[T-ij | Dij  = 1,  Z^,  S^]  ~ N(C  + 6ij  + Dx/ 32,  Dx  + Du).  (4-9) 

For  each  subject,  missing  value  in  the  vector  of  exposure  variables  X.Lj  could  occur 
in  K — 2P  many  ways.  For  example,  if  X^  has  two  components  (say  Xuj  and 
X2 ij)  possible  pattern  of  missingness  are  ( i ) both  Xxij  and  X2ij  are  missing  (ii) 

Xuj  is  missing  and  X2ij  is  observed,  (to)  X2ij  is  missing  and  Xxij  is  observed, 
and  (iv)  none  of  Xxij  and  X2ij  is  missing  . Let  6k,  k — 1,  • • • , K represent  the 
indicator  variable  corresponding  to  each  pattern  of  missingness.  Therefore,  for 
each  subject  one  observes  a vector  of  indicator  variables,  A,j  = (<$L,  Sf  ■ ■ ■ S-j ) 
which  takes  value  1 in  exactly  one  position  and  zero  in  all  other  positions.  I assume 
that  the  missingness  patterns  are  lexicographically  ordered,  i.e. , Sjj  — 1 when 
the  missingness  pattern  is  (0,  0, ... , 0)  (denoting  all  exposures  are  missing),  and 
= 1 if  missingness  pattern  is  (1, 1,  • • • , 1),  i.e.,  the  exposure  vector  is  completely 
observed. 

Further  assume  that  pk(Xij)  and  pk(Xij)  denote  the  joint  distribution  of  the 
observed  components  of  Xtj  corresponding  to  control  and  case  population,  for 
fc  = 1,2,--*  ,K.  When  all  the  components  of  Xvj  are  missing  I set  pl(Xtj)  — 1 and 
p\(Xij)  — 1.  Similarly  one  can  define  pk(Tij)  and  p\(Tij)  for  the  surrogate  variable 
T. 
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Using  (4.6)-(4.9)  one  can  compute  the  full  conditional  likelihood  conditioned 


where  fi*a  = A0i  + Af  Zn  + Dxf32  and  = A0i  + Af  Zxy 

The  above  likelihood  involves  /31,  /32,  Ai,  Dx,  C,  and  Du  and  a set  of  nuisance 
parameters  A0i,  i = 1,  2,  • • • , s which  grows  with  direct  proportion  to  the  sample 
size.  Hence  maximum  likelihood  estimates  of  the  parameters  be  potentially 
inconsistent.  I plug  in  the  estimates  of  C and  Du  obtained  from  a validity  data 
into  the  likelihood  (4.10).  I use  diffuse  normal  prior  for  (3l)  (32,  A\,  inverse  gamma 
prior  on  the  variance  components  of  X and  uniform  (0, 1)  prior  on  the  correlation 
coefficient  between  any  two  exposure  variables.  Finally,  I assume  that  A0i  ~ G, 
and  G ~ DP(aG o),  Dirichlet  process  prior  with  concentration  parameter  a , and 
base  measure  Go.  Because  in  our  data  analysis  p equals  2,  I took  G0  as  a bivariate 
normal  distribution.  All  the  parameters  were  estimated  by  Markov  chain  Monte 
Carlo  technique  and  computation  details  are  given  in  Section  4.5. 

If  there  is  no  missing  observation  i.e. , Ay  = (0,  • • • , 0, 1)  Vi,  j,  one  can  work 
with  the  following  likelihood.  The  latter  does  not  involve  any  nuisance  parameters. 
This  is  in  contrast  to  proposed  in  Armstrong  et  al.  (1989). 


on  Dij  — 1.  Without  loss  of  generality  one  can  assume  that  DiX  — 1,  and 


Dij  — 0,  j = 2,  • • • , M + 1.  Then  the  likelihood  is 


(4.10) 
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IV  exp[/3^ (Zir  - Za)  + (3T2  (»ir  - tfj] 


1 


M+ 1 


(Dx  + Du ) 1(Ty  — Tn  — Af(Zij  — Zn ) + Dx(32)] 


(4.11) 


4.4  Example:  Colon  Cancer 


I apply  the  proposed  method  to  a case-control  data  from  a colon  cancer  study 
in  Canada  (see  Miller  et  al.,  1983)  where  171  male  cases  of  colon  cancer  were 
individually  matched  by  age  and  neighborhood  of  residence  to  171  controls.  Height 
and  weight  are  measured  for  each  study  individual  apart  from  their  measurement 
on  calories,  protein,  total  fat,  dietary  fat,  and  R carotin.  The  goal  of  the  study  is 
to  see  the  effect  of  the  two  continuous  exposure  variables  such  as  total  calories(Xj) 
and  fiber  intake(X2)  on  the  risk  of  colon  cancer.  An  interesting  feature  of  this 
dataset  is  that  reliability  and  validity  studies  reveal  that  the  measurement  on 
the  exposure  variables  are  subject  to  a correlated  measurement  error  structure. 
Armstrong  et  al.  dealt  with  the  issue  of  measurement  error  in  this  dataset.  I 
propose  a Bayesian  alternative  for  correcting  for  the  measurement  error  and  then 
model  the  association  of  the  adjusted  exposures  in  terms  of  completely  observed 
covariates  such  as  height  and  weight  of  the  subject.  I also  study  the  effect  of 
missingness  in  this  situation  by  introducing  artificial  missingness  in  the  exposures. 
This  example  accounts  for  interesting  unorthodox  data  scenarios  that  one  may 
have  at  hand  and  addresses  the  problem  of  missingness,  measurement  error  and 
association  modeling  for  a set  of  continuous  exposures  under  the  same  framework. 


81 


There  was  an  initial  study  which  investigated  the  validity  of  the  data  by 
comparing  dietary  histories  reported  by  16  healthy  volunteers  with  detailed  weighed 
food  records  kept  by  the  volunteers’  spouse.  The  spousal  record  were  considered  as 
a gold  standard.  The  sample  mean  and  dispersion  of  the  difference  between  these 
two  records  served  as  a direct  estimate  of  the  parameters  of  the  distribution  of 
the  measurement  error.  Therefore  the  estimate  of  the  mean  of  t/,j  namely,  C and 
dispersion  matrix  Du , are  obtained  from  this  validity  data.  I consider  height  and 
weight  as  two  covariates. 

Sample  means  of  recorded  and  reported  intakes  for  total  calories  and  fiber  from 
the  validity  data  were  T — (32.0,31.9)  and  X — (25,20.4);  their  mean  difference 
serves  as  an  estimate  of  C,  namely,  C — (6.9, 11.5).  The  moment  estimate  of  Du 
from  the  validity  data  is, 

( 56.6  35.1  ^ 


Du  — 


^ 35.1  70.1 

and  consequently,  Du  is  replaced  by  Du  in  the  likelihood  in  (4.10).  I put  prior 
on  the  components  of  Dx  in  the  fully  Bayesian  analysis.  I used  IG(4.2,100),  and 
IG(4,100)  priors  for  cr^  , and  a^2  respectively,  and  uniform  prior  on  the  correlation 
coefficient  (pXlx2)  between  total  calories(Ad),  and  fiber  intake  (Al2). 

For  the  proposed  analysis  I assumes  a bivariate  normal  distribution  for  the 
exposure  variables  in  the  control  population.  I used  Normal(0, 10)  prior  for  all 
the  regression  parameters  and  log-odds  ratio  parameters,  and  a bivariate  normal 
distribution  BVN(0,  0, 10, 10, 0.5)  for  the  base  measure  of  the  Dirichlet  process 
(DP)  prior.  With  the  same  ideas  of  previous  Chapters,  here  I consider  again  two 
situations:  (i)  when  stratum  effect  is  constant  on  the  distribution  of  exposure 
variables,  i.e.,  A0fc  = A0,  and  use  a bivariate  normal  prior  on  this  parameter, 
and  (ii)  when  the  stratum  effect  on  the  distribution  of  the  exposure  variable  are 
assumed  to  vary  across  strata.  In  this  situation,  I use  DP  prior  on  G.  Since  G 
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is  almost  surely  discrete,  the  DP  prior  allows  equality  of  some  of  the  stratum 
effects.  The  number  of  distinct  stratum  effect  depends  on  the  value  of  a.  For 
the  ease  of  notation,  the  first,  and  the  second  cases  are  referred  to  as  PBC  and 
BSP  respectively.  Table  4-1  contains  the  posterior  means,  posterior  standard 
deviation  and  95%  HPD  credible  intervals  for  the  parameters  under  the  proposed 
analysis  along  with  the  naive  analysis  of  the  data  without  considering  measurement 
error,  and  the  measurement  error  corrected  method  in  Armstrong  et  al.  (1989) 
abbreviated  as  BA. 

The  naive  uncorrected  analysis  suggests  a significant  positive  association  of  the 
disease  and  total  calories,  and  negative  association  between  the  disease  and  total 
fiber  intake.  These  association  become  strong  when  I consider  measurement  error 
in  the  predictor  variables. 

I generated  10%  missing  observation  completely  at  random  observation  in 
each  of  the  exposure  variable,  and  reran  all  the  analyses.  The  results  are  given 
in  table  4.6.  The  results  are  fairly  clear.  The  first  point  to  note  that  for  both 
situations  PBC,  and  BSP  have  smaller  standard  errors  for  calory  than  that  of 
BA.  For  partially  missing  data  the  estimates  in  BA  get  changes  by  50%  (Calory), 
and  76%  (Dietary  Fiber)  which  is  much  higher  than  that  of  PBC  and  BSP. 

After  correction  for  measurement  error,  PBC  and  BSP  shows  significant  negative 
association  between  colon  cancer,  and  dietary  fiber  whereas  such  effect  is  absent 
in  BA’s  analysis.  Finally,  one  can  see  that  in  all  situations  BSP  method  produces 
efficient  estimate  of  the  parameters  in  terms  of  standard  errors.  In  the  presence  of 
missing  observation,  BSP  outperforms  all  other  methods  and  has  a clear  edge  over 
the  other  methods  as  the  percentage  of  missing  observation  increases  in  the  dataset. 

4.5  Computational  Details 

The  basic  computational  scheme  is  the  same  as  in  the  previous  Chapters.  I 
highlight  primarily  the  differences  that  I encounter  for  the  present  modeling. 
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Note  that  apart  from  the  regression  parameter  Ai,  and  the  log-odds  ratio  pa- 
rameters /31  and  /32,  I have  another  set  of  parameters  through  Dx.  At  every  cycle, 
value  of  each  of  the  parameters  are  updated  by  Metropolis-Hastings  algorithm  and 
finally  with  all  four  updated  values  of  /31,  (32,  Ai,  and  Dx  I move  to  the  iteration 
steps  for  Aoj,  i — 1,  • ■ • , s. 

Generating  observations  from  conditional  of  Ao*:  A key  feature  of  the  Dirichlet 
process  is  its  discreteness,  which  in  my  context  implies  that  the  A0j,  i — 1,  • • • , s 
concentrate  on  a set  of  some  k < s distinct  values.  First  of  all,  the  random  measure 
G with  Dirichlet  process  prior  has  prior  mean  E(G)  — Go  , and  a the  precision 
parameter  determines  the  closeness  of  G around  Go.  The  larger  the  a , the  closer 
will  G tend  to  be  to  Go. 

Note  that  the  prior  of  A0i  conditional  on  A0(-i)  is, 


where  A0(_i)  = (Aoi,  ■ • • , Ao»-i,  Ao»+i,  • ■ • , Aos),  the  vector  of  all  but  ith  stratum 
effect  parameter.  Note  that  as  a — ► oo,  and  a — » 0 cluster  size  become  k — s, 
and  k — 1 respectively.  The  intuitive  idea  behind  this  is  that  whenever  A0i  comes 
from  Go,  it  will  be  distinct  from  the  rest,  and  if  a — 0 the  second  term  of  (4.12) 
vanishes,  and  all  the  A0i  come  from  the  same  cluster.  Note  that  asymptotically,  as 
s — > oo,  the  cluster  size  has  a prior  mean  alog(s). 

When  (4.12)  combined  with  the  likelihood,  this  yields  the  following  conditional 
distribution 


(4.12) 


^Oz|Aq(— i),  Hi  ~ ^ T 


(4.13) 


Here 


Qi,j  ^ ( tji , Aqj  ) 

n = ba  f Lc(yi,0)Go{6) 
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where  yi  is  the  observed  data  in  the  stratum  i,  and  b is  such  that  Qi,j  + r*  = 1- 
Now  the  Gibbs  sampling  is  feasible  if  r*  or  Hi  have  a tractable  form  which  is 
possible  if  Go  is  conjugate  prior.  Since  Go  is  nonconjugate  so  I use  Metropolis- 
Hastings  algorithm  to  update  each  values  of  A0j. 

To  illustrate  the  computational  steps,  let  us  fix  our  attention  to  a particular 
stratum,  say,  stratum  i.  In  my  data  analysis  I will  take  Go  as  a bivariate  normal 
distribution  with  mean  Co  = 0 and  variance-covariance  matrix 


I proceed  in  the  following  way: 

Step  1.  Start  with  some  reasonable  values  of  A0i,  i = 1, s.  The  current  state 
of  the  Markov  chain  consists  of  these  s values,  namely,  (A0i, Ao», ..,  A0s). 

Step  2.  Draw  a candidate  value  Aj^  from  the  distribution  of  A0i|A0(_;),  namely, 


There  are  several  ways  of  generating  observations  from  the  above  mixture 
distribution,  which  essentially  reflects  that  for  the  i-th  stratum  effect  you  either  get 
a new  distinct  value  from  the  normal  component  of  the  mixture  with  probability 
a/ (a  + s — 1),  otherwise  it  is  drawn  with  equal  probability  from  the  current  set  of 


parameters. 

Step  3.  Compute  the  acceptance  probability  a(AJi,  A0j)  = min{l,  Lc(h*0i\-)l 
Ac(Aoi|-)},  where  Lc(AqJ-),  and  Lc(A0i|-)  are  the  conditional  likelihoods  as  given  in 
(4.10)  evaluated  at  A^,  and  A0j,  respectively,  with  all  other  parameters  held  fixed 


from 


S 


(a  + s — 1)  1 yt  c>a0,  (-)  + {a/(a  + s — l)}Ar2(Co,  ^o), 


(s  — 1)  entries  of  A0(_j),  the  vector  corresponding  to  the  other  ( i — 1)  stratum  effect 


at  their  current  values. 

Step  4.  Set  the  new  value  of  Aoi  to  A^  with  the  above  probability. 
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Step  5.  Repeat  Steps  2-4  R times.  I chose  R = 5 in  all  my  computations. 
Consider  the  last  of  these  R updates  of  A0i  as  the  current  value  of  the  i-th  stratum 
effect  parameter.  I repeat  the  above  steps  for  all  s stratum  effect  parameters. 

The  full  conditional  for  a is  the  same  as  described  in  the  previous  Chapter.  To 
be  specific,  for  a G(a,  b ) prior  on  a,  one  has 

7r(a|r?,  k)  = pG(a  + K,b  — log (77))  + (1  - p)G(a  + k - 1,  b - log(r?)) 

where  p — (a  + k — l)/[a  + k — 1 + s{b  — log(r;)}].  Here  r)  ~ B(a  + 1,  s)  and  k,  be 
the  number  of  distinct  Aoj. 

4.6  Concluding  Remarks 

To  summarize  the  finding  of  this  Chapter,  one  should  note  that  this  is  the  first 
attempt  to  address  measurement  error  and  missingness  in  exposure  variables  in 
matched  case-control  studies  in  a unified  framework.  Also,  here  I consider  multiple 
continuous  exposures  with  some  association  structure  and  account  for  unmeasured 
stratum  heterogeneity  on  exposure  distributions,  which  is  not  measured  through 


observed  covariates. 
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Table  4-1:  Analysis  of  colon  cancer  data  using  full  dataset 


Parameter 

BSP 

Mean  Sd  HPD  region 

PBC 

Mean  Sd  HPD  region 

Height 

Weight 

Calory 

Dietary  Fiber 

0.33  0.85  (-1.29  ,2.08) 

0.38  0.89  (-1.39  ,2.14) 

0.06  0.026  (0.007,0.125) 

-0.068  0.026  (-0.12,-0.02) 

0.46  0.81  (-1.11,2.04) 

0.49  0.79  (-1.14,2.0) 

0.051  0.019  (0.01,0.08) 

-0.032  0.014  (-0.05,-0.01) 

Parameter 

CLR 

Mean  Sd 

BA 

Mean  Sd 

3C  stand  for 

Height 

Weight 

Calory 

Dietary  Fiber 

0.39  0.89 

0.57  0.86 

0.042  0.014 

-0.022  0.012 

* * 

* * 

0.10  0.034 

-0.030  0.019 

BSP  stands  f 

or  Bayesian  semiparametric  method  whereas  P 

parametric  Bayes  methods  assuming  constant  stratum  effects;  and  CLR,  and  BA 
stand  for  conditional  logistic  regression  and  method  proposed  in  Armstrong  et  al. 

(1989)  respectively. 
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Table  4-2:  Analysis  of  colon  cancer  data  with  10%  artificial  missing  observation  in 
each  of  the  exposure  variables 


Parameter 

Mean 

BSP 

Sd  HPD  region 

Mean 

PBC 

Sd  HPD  region 

Height 

0.33 

0.84 

(-1.35,1.85) 

0.49 

0.82 

(-1.15,1.99) 

Weight 

0.36 

0.78 

(-1.14,1.84) 

0.58 

0.79 

(-1.08,2.08) 

Calory 

0.047 

0.032 

(0.001,0.13) 

0.069 

0.034 

(0.02,0.15) 

Dietary  Fiber 

-0.057 

0.028 

(-0.12,-0.01) 

-0.030 

0.014 

(-0.06,-0.02) 

BSP  stands 


Parameter 

CLR 

Mean  Sd 

BA 

Mean  Sd 

Height 

2.14 

1.34 

* 

* 

Weight 

0.028 

1.99 

* 

* 

Calory 

0.068 

0.022 

0.15 

0.056 

Dietary  Fiber 

-0.055 

0.018 

-0.053 

0.022 

:'or  Bayesian  semiparametric  method  whereas  PBC  stand  for 


parametric  Bayes  methods  assuming  constant  stratum  effects;  and  CLR,  and  BA 
stand  for  conditional  logistic  regression  and  method  proposed  in  Armstrong  et  al. 

(1989)  respectively. 


CHAPTER  5 

MODELLING  ASSOCIATION  AMONG  MULTIVARIATE  EXPOSURES  IN 

CASE-CONTROL  STUDIES 

5.1  Introduction 

There  are  many  instances  where  the  exposure  variables  themselves  may  be 
associated  among  themselves.  One  such  association  model  is  considered  in  Chapter 
4 dealing  with  multivariate  continuous  exposure  variables.  We  consider  two  other 
association  schemes  in  this  Chapter:  (i)  two  exposure  variables  with  a bivariate 
binary  distribution  and  (ii)  two  correlated  exposure  variables,  one  binary  and 
the  other  continuous.  I propose  a full  likelihood  based  approach  by  a parametric 
modeling  of  the  multivariate  exposure  distribution  incorporating  their  association. 
For  matched  case-control  studies,  unexplained  stratum  heterogeneity  may  be 
present.  Our  model  incorporates  this  stratum  heterogeneity  on  the  parameters 
involved  in  the  exposure  distribution.  Since  the  model  and  the  likelihood  turn  out 
to  be  of  a complex  form  with  a growing  number  of  parameters,  the  parameters 
are  estimated  in  a fully  Bayesian  framework  by  using  the  Markov  chain  Monte 
Carlo  technique.  Also,  our  model  accomodates  partial  missingness  in  the  exposure 
variables. 

Outside  the  case-control  context  there  has  been  extensive  amount  of  work  on 
modeling  association  among  multivariate  exposures.  Zhao  and  Prentice  (1990)  pro- 
posed a pseudolikelihood  method  for  the  estimation  of  parameters  of  a quadratic 
exponential  family  in  terms  of  the  marginal  means  and  pairwise  correlation.  Lang 
et  al.  (1999)  proposed  association  modeling  in  multivariate  categorical  response 
data.  They  formulated  generalized  log-linear  models  that  simultaneously  model 
the  association  structure  as  well  as  the  marginal  distributions  of  the  responses 
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and  outlined  how  to  obtain  the  maximum  likelihood  estimates  of  the  parameters. 
Ekholm  et  al.  (2000)  proposed  association  models  for  multivariate  binary  data 
through  latent  variables  and  Markov  type  dependence  structure.  But  so  far,  to  the 
best  of  my  knowledge,  none  of  the  models  have  been  extended  to  the  case-control 
domain. 

We  reemphasize  that  one  important  feature  of  this  Chapter  is  that  it  touches 
on  is  simultaneous  modeling  of  the  joint  distribution  of  categorical  and  continuous 
exposures  in  a case-control  context.  There  has  been  some  relatively  recent  Bayesian 
work  involving  modeling  the  marginal  joint  distribution  of  the  exposures.  Muller 
and  Roeder  (1997)  considered  a novel  semiparametric  model  for  the  joint  distri- 
bution of  continuous  exposures  in  presence  of  measurement  error.  They  assumed 
a prospective  model  for  the  disease  status  and  a joint  marginal  distribution  for 
the  exposures.  The  evaluation  of  the  retrospective  likelihood  involves  computing 
the  marginal  disease  probabilities  which,  in  turn,  require  an  integration  with  re- 
spect to  the  exposure  distribution.  Computation  of  this  integral  becomes  hard  in 
a high-dimensional  exposure  space.  Seaman  and  Richardson  (2001)  adapted  the 
Miiller-Roeder  approach  for  several  categorical  exposure  variables.  To  avoid  the 
high-dimensional  integration,  they  advocated  converting  continuous  exposures  into 
categorical  ones  which  certainly  entails  some  loss  of  information.  As  one  should 
note,  in  the  formulation  of  the  likelihood  as  presented  in  this  Chapter,  I avoid 
evaluation  of  the  marginal  prevalence  of  disease  and  consequently,  do  not  need  to 
perform  a high  dimensional  numerical  quadrature. 

Muller  et  al.  (1999)  adopt  a fully  retrospective  model  for  analyzing  case- 
control  data  using  the  joint  distribution  of  a mixed  set  of  quantitative  and  cate- 
gorical variables.  They  assumed  a mixture  of  normal  model  for  the  quantitative 
covariates  and  conditional  on  the  quantitative  covariates,  a multivariate  probit 
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model  for  the  categorical  covariates.  Our  formulation  of  the  likelihood  and  asso- 
ciation modeling  is  different  from  their  proposal.  First,  I do  assume  a prospective 
model  and  only  need  to  assume  the  distribution  of  the  exposures  in  control  popula- 
tion in  order  to  accommodate  partial  missingness  in  the  exposure  variables.  Muller 
et  al.  took  a fully  retrospective  route  and  thus  had  to  assume  the  distribution  of 
exposure  in  both  case  and  control  populations.  I model  the  exposure  distribution 
related  parameters  via  a completely  observed  covariate  which  is  not  the  case  for 
Muller  et  al.  (1999).  Our  work  focuses  on  matched  case-control  studies  and  ac- 
counts for  stratum  heterogeneity  on  the  exposure  distribution  whereas  Muller  et  al. 
considered  unmatched  studies  where  there  is  no  need  to  model  such  unexplained 
stratum  specific  effects. 

Two  datasets  are  analyzed  in  the  Chapter,  each  interesting  in  its  own  right. 
First,  I consider  the  association  structure  for  a bivariate  binary  exposure  in  the 
context  of  the  well-known  Los  Angeles  Endometrial  Cancer  Study  (Breslow  and 
Dey,  1980)  which  contains  natural  missingness  in  one  of  the  exposures.  Second, 

I consider  the  scenario  when  there  is  one  binary,  and  one  continuous  exposure. 

A case-control  dataset  on  fibrocystic  breast  disease  is  (Pastides  et  al.,  1985) 
considered  to  illustrate  the  methodology.  One  of  the  significant  exposures  is 
partially  missing  in  this  dataset. 

In  the  both  examples,  I find  that  by  exploiting  the  association  among  ex- 
posures, especially  in  the  presence  of  missingness,  one  obtains  better  estimates 
with  relatively  smaller  standard  errors  where  there  is  a real  association  among  the 
exposures.  A small  scale  simulation  study  supports  this  finding. 

The  rest  of  the  Chapter  is  organized  as  follows.  In  Section  4.2  I describe  my 
model,  notations  and  formulation  of  the  likelihood  in  presence  of  missingness  in 
a matched  case-control  set-up.  Section  4.3  is  devoted  to  two  different  association 
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models  for  multiple  exposure  variables  as  described  above.  Each  model  is  accom- 
panied by  the  corresponding  data  example  illustrating  the  proposed  methodology. 
Section  4.4  contains  a small  scale  simulation  study  to  indicate  the  effect  of  associ- 
ation modeling  when  association  among  exposures  truly  exists  and  Section  6 is  the 
concluding  section  with  a discussion  of  the  findings  and  final  remarks. 

Before  concluding  this  section  I highlight  some  strikingly  new  features  of  this 
article.  First,  I consider  modeling  possible  association  among  exposure  variables 
having  potential  missing  observations  in  a matched  case-control  set-up.  Second, 

I consider  two  different  association  schemes  including  simultaneous  modeling 
of  continuous  and  categorical  risk  factors  which  may  be  associated.  Third,  the 
presence  of  unexplained  stratum  heterogeneity  on  the  exposure  distribution 
is  accommodated  in  my  model  as  I am  dealing  with  highly  stratified  matched 
studies.  Finally,  I address  unorthodox  data  scenarios  which  include  missingness, 
measurement  error  and  modeling  the  distribution  of  a mixed  set  of  discrete  and 
continuous  exposures  in  a matched  case-control  set-up  which  is  often  encountered 
in  real  studies.  The  examples  reflect  such  real  scenarios  and  the  need  for  taking 
these  nonstandard  data  issues  into  account  while  formulating  my  model  and 
methodology. 

5.2  Model  and  Notation 

For  subject  j in  matched  set  i,  j = 1,  •••  , M + 1;  i — 1,  ■ ■ ■ , s,  let  Dij  be  a 
binary  indicator  for  the  disease  status,  namely,  D ^ = 1 for  case  and  D ^ = 0 for 
control.  For  each  subject  in  the  dataset  one  observes  a p x 1 vector  of  exposure 
variables  Xtj  and  a q x 1 vector  of  covariates  Zzj  which  is  completely  observed. 

I allow  for  the  possibility  that  different  components  of  Jy  may  be  missing  for 
different  subjects.  Following  the  notations  of  Chapter  4,  for  each  subject  one 
observes  a vector  of  indicator  variables  A y = (£h,  ■■  - 6$). 
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I assume  that  p(Xij\Si,  Zij,  Dij  = 0)  be  a distribution  of  the  exposure  variable 
in  the  control  population  with  respect  to  a c-finite  dominating  measure  /r.  Note 
that  by  modeling  the  distribution  of  the  X in  terms  of  Z one  captures  stratum 
heterogeneity  measured  through  Z but  there  may  still  be  some  unexplained 
heterogeneity  which  would  be  captured  through  a random  intercept  term.  I assume 
that  7:  be  a vector  of  parameters  in  the  distribution  of  X corresponding  to  Z,  and 
70  be  a vector  of  stratum  specific  parameters. 

Assume  that  the  prospective  model  for  the  disease  given  the  exposure  vari- 
ables, covariates  and  stratum  effect  is 

pr(Ai  = l|5i,  Xa,  Z^)  = H(/3o{Si)  + 0[Z{j  + (5.1) 

where  H(x)  = 1/(1  + e~x);  Po(Si)  is  the  stratum  specific  intercept  term;  and  f31, 
and  (32  are  the  two  log-odds  ratio  parameters  corresponding  to  the  completely 
observed  covariates,  and  the  exposure  variables. 

In  case  of  fully  observed  data,  the  conditional  likelihood,  conditioning  on 

= 1)  eliminates  the  nuisance  parameters  (3o(Si)  and  involves  only  the 
log-odds  ratio  parameters.  In  the  presence  of  missingness,  the  classical  method  of 
conditional  logistic  regression  is  inefficient.  In  Chapter  2,  I proposed  a Bayesian 
semiparametric  method  for  accounting  for  missing  data  via  modeling  of  the 
exposure  distribution.  The  potential  drawback  of  this  method  is  that  it  does  not 
take  into  account  the  underlying  association  structure  among  a set  of  multivariate 
exposure  variables  which  may  prove  to  be  useful  when  some  exposure  variables  are 
missing.  The  intuitive  reason  for  modeling  the  association  is  to  borrow  information 
regarding  the  missing  exposure  from  the  corresponding  observed  exposure  variables 
through  the  association  structure. 
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The  basic  structure  of  the  joint  likelihood  in  the  presence  of  missingness  is 

P r(-Aj , Xij , , Si)  — p(X , Aj j,  Zij , Si)  xp(Aij\Dij , Zij , Si)  x pr (-Aj \ Zij,  Si). 

(5-2) 

Note  that  instead  of  using  a retrospective  likelihood  I work  with  a joint  likelihood 
of  disease  variable,  exposure  variable,  and  missing  value  indicator  given  the 
completely  observed  covariates,  and  the  stratum  effect.  The  consistency  and 
asymptotic  property  of  the  maximum  likelihood  estimator  obtained  from  the  above 
likelihood  follow  from  Rathouz  (2003). 

Following  the  notations  of  Chapter  4,  and  assuming  without  loss  of  generality 
that  the  first  subject  in  each  stratum  is  a case  and  rest  are  controls,  by  Lemmas  1 
and  2 of  Chapter  2,  one  begins  with  the  likelihood 


£c(/3i,/32>7i,7o)  = 

S 
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J]  Dir  = 1) 

r—  1 

s ( M+l 

} J \ pr(Ai  - 1,  Da  = 0,  • • ■ , Am+i  = 0,  | Su  {Z^}* 1+1,  ]T  Dir  = 1) 


i=l 


r— 1 


M+l 


OC 


n 


p(Xn\S,,Zn,Dn  = 1,  A,,)  J[  = 0,  A«) 

j= 2 
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(5.3) 


1=1  t fc=l 

The  above  expression  is  a generalization  of  (4.10).  The  parameters  involved  in 
the  above  likelihood  are  estimated  in  a fully  Bayesian  framework.  The  details 
will  be  provided  for  each  individual  case.  As  one  will  find,  there  will  be  no  closed 
form  available  for  the  posterior  and  any  exact  Bayesian  inference  is  analytically 


94 


intractable.  Hence  I estimate  the  model  parameters  through  Markov  chain  Monte 
Carlo  numerical  integration,  and  in  particular,  via  Gibbs  sampling  techniques  of 
generating  sample  from  conditional  distributions  of  the  different  parameters  given 
the  remaining  parameters  and  the  data.  The  general  computational  scheme  is 
similar  as  given  in  previous  Chapters  with  the  added  simplification  that  we  have 
not  used  any  Dirichlet  process  prior  on  the  stratum  effect  parameters  and  instead 
have  used  only  independent  normal  priors.  Use  of  Dirichlet  process  prior  in  this 
context  is  a part  of  my  future  research. 

In  the  following  subsections  I present  the  specific  forms  of  the  likelihoods  and 
other  details  for  the  two  association  scenarios  I consider. 

5.2.1  Bivariate  Binary  Exposure  Variable 

I first  consider  the  situation  with  bivariate  binary  exposure  variables  Xij  = 
with  corresponding  log  odds  ratio  parameter  /32  = Assume 

the  retrospective  distribution  of  the  exposure  variable  in  control  population  to  be, 


pixW.xg^Z^Dij 


= °)  = 7^  exP(0r 


c\ 


3(1)  yW  I fl(2)  ya 
Ui  Aij  + uij  A*j 


(2) 


+ xijxffxg)) 


(5.4) 


where  d^\  , and  are  modeled  as  9^  = yon  + 7 nZij,  9-f  = 702 * + 712 Zij, 

and  = 7o3z  + 7i3^q-  is  the  normalizing  constant. 

Remark  1.:  Note  that  A ^ is  the  odds  ratio  between  the  two  random  variables,  i.e. , 


A p(xV  = 1,  x\f  = l|5a,  Zij,  Djj  = 0 )v(x\f  = 0,  x[f  = 0|5a,  Zjj^Djj  = 0) 

ij  P{X[]]  0,  x\f  = 115,,  Zijt  Dij  = 0 )p(x\f  = 1,  X[f  0|5,,  ZVJ , Dtj  = 0) 

(5.5) 

and  9^  could  be  interpreted  as  the  logit  of  probability  of  success  of  X-^  condi- 
tioned on  X[f  = 0,  i.e.,  p(x\f\S^  Zijt  I)u  = 0,xW  = 0)  = exp^flgtyil  + 
exp(0-J1))}.  Similarly  9^  could  be  interpreted  as  the  logit  of  the  probability  of 
success  of  xff  given  that  X-j'1  = 0. 
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Using  (5.4)  in  Lemma  1 of  Chapter  2 one  obtains, 


where  C\f  = 1 + exp(6»|j))  + exp(6^2))  + exp(6>|j)  + 0 \f  + Ay)  and  C\f 
1 + exp(0-^  + fyi)  + exp(0-^  + P22)  + exP(^  + @if  + \j  + P21  + /?22)  • 


Using  Lemma  2 of  Chapter  2 and  (5.6)  one  obtains  the  exposure  distribution 
in  the  case  population  as: 


Marginally  each  component  of  Xy  is  a binary  random  variable  and  one  can  easily 
obtain  the  marginal  distributions  in  case  and  control  population.  In  this  situation 
A ij  — Using  (5.4),  (5.6)  and  (5.7)  we  can  write  the  likelihood  as 


To  estimate  the  parameters  I use  independent  normal  prior  on  each  of  the  parame- 
ters and  estimate  them  through  Markov  chain  Monte  Carlo  integration  scheme. 
Example:  Endometrial  Cancer  Data 

Let  us  revisit  the  endometrial  cancer  data  as  considered  in  Chapter  2.  Most 
of  the  major  risk  factors  for  endometrial  cancer  can  be  characterized  as  states  of 
excess  unopposed  estrogenic  stimulation  of  the  endometrium.  Several  studies  have 
indicated  that  exogeneous  estrogen  treatment  increases  the  risk  of  endometrial 
cancer  (see  Schottenfeld  and  Fraumeni,  1996).  Among  other  factors,  obesity  may 
increase  the  risk  of  endometrial  cancer  by  altering  estrogen  metabolism.  Hyper- 
tension has  also  been  reported  to  be  common  among  women  with  endometrial 
cancer.  But  several  studies  controlling  for  certain  confounding  factors  such  as, 


(5.7) 
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parity,  recent  weight  gain  or  estrogen  use,  showed  no  evidence  of  increased  risk  of 
endometrial  cancer  in  presence  of  hypertension,  concluding  that  the  apparent  effect 
of  hypertension  is  in  fact  manifested  through  increased  levels  of  obesity  and  estro- 
gen use.  The  presence  of  gall  bladder  disease  may  increase  the  risk  of  endometrial 
cancer  and  that  could  potentially  be  due  to  an  association  of  gall  bladder  disease 
with  obesity  or  estrogen  replacement  therapy.  Obesity  is  a potential  exposure  with 
16%  values  missing  in  the  dataset.  Though  there  are  several  other  potential  risk 
factors  such  as,  smoking,  total  dietary  fat  and  alcohol  consumption  recorded  in  this 
dataset,  it  has  been  scientifically  established  that  the  use  of  estrogen,  depending 
on  the  dose  level,  plays  a major  role  in  endometrial  cancer  disease.  With  this  un- 
derlying medical  theory  in  mind,  I reanalyzed  the  LA  cancer  dataset  with  obesity 
and  gall  bladder  disease  as  two  associated  binary  exposures.  Estrogen  plays  the 
role  of  a completely  observed  covariate  Z through  which  parameters  related  to  the 
bivariate  binary  exposure  distribution  are  modeled. 

Bivariate  binary  distribution  is  assumed  for  the  two  exposure  variables  in 
the  control  population.  The  related  parameters  9^  and  A^’s  are  modeled 
in  terms  of  Z (estrogen  use)  as:  6^  — ^mi  + 7n%,  dlf  = 702 » + 712%,  and 
A ij  = 703 i + 713%.  I used  Normal(0,10)  priors  for  each  of  the  parameters.  For 
each  of  the  examples  I perform  three  analyses.  The  first  is  my  proposed  method 
of  Bayesian  analysis  accounting  for  the  association  among  the  exposure  variables 
(denoted  by  AM).  The  second  is  a Bayesian  method  following  Sinha  et  al.  (2003) 
where  exposure  variables  are  assumed  to  be  independent  (denoted  by  IM)  and  the 
third  one  is  the  usual  conditional  logistic  regression  (CLR).  For  the  independence 
modeling  I set  the  log-odds  ratio  coefficient,  A y equal  to  zero  in  (5.4). 

I carried  out  the  three  analyses  on  the  observed  data.  I reran  all  the  three 
methods  when  30%  observations  on  the  exposure  variable  (the  presence  of  gall 
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bladder  disease)  were  randomly  deleted.  I then  compared  the  performances  of  the 
three  methods  in  presence  and  absence  of  missingness. 

To  summarize  the  results,  for  the  observed  data,  one  can  notice  that  use  of 
estrogen  and  gall  bladder  disease  are  both  significant  risk  factors  for  endometrial 
cancer.  There  is  very  little  numerical  difference  between  the  estimates  and  their 
standard  errors  obtained  by  using  the  AM  model  and  the  IM  model.  In  contrast, 
for  CLR  one  observes  a significantly  higher  value  of  the  standard  error.  With 
artificial  missingness  in  gall  bladder  disease  one  can  notice  a significant  difference 
in  the  estimates,  and  in  the  standard  errors  between  AM  and  IM.  AM  produces 
estimates  with  smaller  standard  error  than  IM  and  the  AM  estimates  are  relatively 
closer  to  their  original  observed  data  counterparts.  The  explanation  lies  in  the  fact 
that  the  association  between  gall-bladder  disease,  and  obesity  is  exploited  in  the 
AM  model,  yielding  more  efficient  estimates.  As  expected  CLR  produces  worse 
estimates  in  presence  of  missing  exposure  variable. 

5.2.2  One  Binary  Exposure  and  Another  Belonging  to  an  Exponential  Family 

In  this  section,  I consider  the  modeling  of  a mixed  set  of  continuous  and 
categorical  exposures.  Though  I write  my  model  with  the  categorical  exposure  to 
be  binary,  the  methods  may  easily  be  extended  to  any  categorical  exposure. 

Suppose  and  are  two  exposure  variables  for  the  jth  subject  in  the  ith 
stratum.  I assume  that  xf^  is  binary  and  the  conditional  distribution  of  xj^  given 
X--1  belongs  to  a general  exponential  family  model  for  all  i and  j.  This  implies 
that  the  marginal  distribution  of  X ^ belongs  to  a two  component  mixture  of 
general  exponential  family.  Thus, 


p{X\f  = 115,;,  Zij,  Dij  = 0)  = 7 Tij 


(5.9) 
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where  logit  (7Ty)  =7oi  + 71 -Zj3  and  the  conditional  distributions  of  X^'1  given 
are  assumed  to  be 


c\f\ Si,  Zijt  Dij  = 0,  xV  = 0)  = exp{  ^ ^ ^ 


+ c(X™,<t>  „)},  (5.10) 


^1  Sit  Zij,  Di:j  = 0,  xg>  = 1)  = exp{!lbfL_^M  m ,r  yP) 


+ c(Xg\<f>1)},  (5.11) 


where  9q ij  = 702 i + ^2  Zij  and  &uj  = 7o3i  + 7 JZij.  Note  the  marginal  distribution  of 
X is  a two  component  mixture  of  exponential  distribution  in  both  the  case  and 
control  population. 

Remark  2.  Note  that  if  Xff  is  a binary  exposure  variable  with  the  following 
parameterization , 


v(X\?\ Si,  Zij,  Dij  = 0,  x\f  = 0)  = exp {90ijx\f  - log(l  + exp(^))} 

p{x[f\Si,  Z^,  D^  = 0,  X\f  = 1)  = exp {eUjxW  - log(l  + exp(0li3))} 

and  p(X-^\Si,  Ztj,Dij  — 0)  = exp{^jX^  — log(l  + e^)},  then  the  joint  distribution 
of  xj^  and  xj^  is  the  bivariate  binary  distribution  (5.4)  of  Section  2.1  with 

+ log{(l  + exp(6>0ij))/(l  + exp(0ii3-))},  9(?)  = 90ij,  and  XtJ  = 9Uj  - 90ij. 
Using  prospective  logistic  model  for  the  disease  status,  and  (5.9)- (5.11)  in 
Lemma  1 of  Chapter  2 one  obtains, 


pr(Aj 

pr(Aj 


1| Sj,  Zjj) 

0|5i,  Z^) 


= J j eMPo(Si)+pTZij+pTXij)p(xV,Xg)\Si,Zij,Dij=0) 

dp{x\f)dx\f 

exp(/?o(5i)  + 01 Z^ ) | (1  - 7 Xij)  J exp {faX^{x\f\Su  Zl3, 

Xg>  = 0,  D^  = 0)dX™ 

+Hij  J exp(p21+p22X^)p(xg)\Sl,  Zl3,  X<f  = l,  D^  = 0 )dx\f  \ 
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= exp(/?0 (Si)  + PiZij){  (1  - 7 Tij)  exp{ 


b(Ooij  + 0o/?22)  — b(6oij) 
0 0 
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i b(9uj  + 0i/322)  - b(9uj) 

+nij  exp{ P21} 


<t>  1 


exp(/?o(5i)  “I-  /3 ^ 


(5.12) 


where 


z,  \ {b(Ooij  + 0o/?22)  ~ b(doij)  f b(duj  + (P1P22)  ~ b(9uj) 

Qij  = (1  — 7Tij)  exp{ — } + 7Tyexp{ 7 + P21}, 
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(5.13) 


and  /i(-)  is  a counting  measure.  Using  Lemma  2,  the  marginal  distribution  of 
x\ ^ in  the  case  population  is  again  seen  to  be  a mixture  of  exponential  family  of 
distributions.  The  joint  conditional  likelihood  is 


Lr  cx 
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E,=i  exp(/3i  Zij)9ij  k=1  J=2  k=1  J 


Note  that  the  extension  of  this  method  to  the  situation  when  conditional  distribu- 
tion of  xff  given  belongs  to  a multi-parameter  generalized  exponential  family, 
is  straightforward  and  do  not  need  any  extra  calculation. 

Example:  Fibrocystic  Breast  Disease 

This  data  is  part  of  a large  case-control  study  done  in  1979  at  two  hospitals 
in  New  Haven,  Connecticut  (see  Pastides  et  al.,  1985)  to  assess  the  risk  factors 
for  benign  fibrocystic  breast  diseases.  The  original  data  consists  of  255  women 
(cases)  aged  20-74  years  with  biopsy  confirmed  fibrocystic  breast  lesions  and  790 
controls  at  two  hospitals  in  New  Haven,  Connecticut  during  1979  (Pastides  et 
al.).  The  purpose  of  the  study  was  to  ascertain  whether  known  risk  factors  for 
malignant  breast  cancer  are  also  established  risk  factors  for  this  type  of  benign 
breast  diseases.  There  are  many  variables  recorded  in  this  dataset  including 
demographic  characteristics  of  the  patient,  medical  history  information,  history  of 
breast  cancer  in  the  family.  The  fraction  of  the  data  used  in  my  analysis  consists 
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of  50  strata  each  of  which  contains  1 case  and  3 controls.  Controls  were  matched 
with  a case  on  the  basis  of  their  age  at  the  time  of  interview.  The  dataset  was 
discussed  in  Hosmer  and  Lemeshow  (2000,  Section  7.8).  There  are  some  variables 
which  are  naturally  missing  in  the  dataset.  Cases  were  found  to  have  a higher 
level  of  education,  a recent  history  of  medical  check-ups  and  a higher  age  at  first 
pregnancy.  The  latter  two  exposures  have  natural  missingness  in  the  dataset.  The 
established  risk  factor  for  this  disease  is  age  at  menarche  (first  menstrual  period), 
late  age  at  menopause,  late  age  at  first  full  term  pregnancy,  postmenopausal 
obesity,  low  physical  activity  and  higher  dose  exposure  to  ionizing  radiation 
early  in  life.  The  data  contain  several  other  covariates  such  as  the  educational 
degree,  marital  status,  regular  medical  checkups,  number  of  stillbirth,  number 
of  miscarriages,  number  of  live  birth.  For  the  purpose  of  my  analysis  I consider 
education  as  the  completely  observed  covariate  Z . History  of  medical  checkups  and 
age  at  first  pregnancy  are  considered  as  two  exposure  variables,  say  and  X ^ . 

Z and  X^  are  defined  specifically  as  follows: 

f the  ijth  subject  has  a junior  college  degree  or  above 


I model  log{7Tij/(l  — n y)}  = yon  + 71  Ztj,  and  the  canonical  parameters  involved  in 
the  distributions  of  X ^ are  modeled  as,  60ij  = 702 i + 72 Z^  and  6^  = 703;  + 73 Zy. 
X was  re-scaled  by  a factor  of  10.  I further  assumed  that 


1 if  the  ijth  subject  had  regular  medical  checkups 


otherwise 


otherwise 


[X[f\ SuZ^Dij  = 0,x£}  = 0]  ~ Normal(0Oij, <Tq), 


and 


[Xg]\ SuZ^Dij  = l,x[f  = 1]  ~ Normally, of). 


It  follows  that  the  marginal  distribution  of  X ^ in  both  the  case  and  the  control 
population  is  a mixture  of  two  normal  distributions.  For  each  of  the  parameters  (3\ 
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(since  Z is  a single  covariate),  /32,  71,  72,  73  and  7ou,  702  i,  703i,  fori  = 1, 2 • ■ ■ , s 
I use  independent  and  fairly  diffuse  normal  priors.  The  posterior  means  of  the 
parameters  are  taken  as  our  estimates.  The  results  are  presented  in  Table  5- 
3. For  the  independence  model  we  assume  that  the  marginal  distribution  of  X ^ 
is  a single  normal  distribution  (in  general  a single  distribution  belonging  in  the 
exponential  family)  instead  of  a mixture  and  carry  out  the  three  analysis.  Here  the 
important  exposure  age  at  first  pregnancy  had  11%  missing  values  in  the  original 
dataset.  One  can  notice  some  interesting  numerical  differences  between  AM  and  IM 
estimates  with  AM  estimates  reflecting  a more  pronounced  effect  of  the  established 
risk  factors.  CLR  is  less  efficient  than  any  of  the  Bayesian  methods. 

5.3  Simulation  Study 

To  study  the  effectiveness  of  association  modeling  in  the  context  of  matched 
case-control  study  I performed  a small  simulation  study.  The  performance  of  the 
association  modeling  was  compared  with  independence  modeling  and  conditional 
logistic  regression  method  in  two  different  situations,  when  two  exposure  variables 
are  associated,  and  when  they  are  independent. 

To  simulate  realistic  data  set  I used  endometrial  cancer  data  as  a prototype.  I 
generated  hypothetical  1:1  matched  dataset  with  100  strata  with  a covariate  Z and 
two  exposure  variables  and  X^2\  The  true  value  of  log-odds  ratio  parameters, 
such  as  /3i,  P21,  and  /?22,  and  other  parameters  involved  in  the  joint  distribution  of 
the  exposure  variables  such  as  7n,  y12  and  713  were  taken  as  2.18,  1.11,  0.67,  -1.40, 
0.19,  and  0.44  respectively.  These  are  the  estimates  of  the  parameters  obtained  by 
analyzing  the  dataset  through  association  modeling  approach.  The  stratum  specific 
intercept  terms  7on,  7020  and  703 * were  generated  from  Normal(— 1.40, 2.37), 
Normal(0.42, 2.85),  and  Normal(— 0.67, 2.95)  respectively  as  were  obtained  from  the 
data  analysis.  In  the  original  dataset,  gall  bladder  disease  was  reported  only  for 
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19%  of  the  subjects  so  the  covariate  Z was  generated  as  Bernoulli  random  variable 
with  success  probability  0.19. 

Second,  I generated  the  binary  disease  indicator  D.  For  the  ith  stratum,  it  may 
be  easily  noted  that, 


pr(Ai  — MDn  + Di2  — 1,  Zn,  Zi2,  Si) 


1 + exp{/3i(Z,2  — Zn)}- 


-.(i)  Mo) 

'i2  °zl 
r»(i) 


(5.15) 


where  for  j — 1,2 

= 1 + exp(6§})  + exppjf)  + exp(6»g)  + 9$  + Xlj), 

C\f  = 1 + exp  (0^  + /?2i)  + exP(#!f  + P22)  + exp(0-^  + 0^  + A^-  + P21  + P22)', 
and  e\f  = 7oii  + 711  Zijy  = 702 » + 712 Ztj,  and  A {j  = 703*  + 713  Zij.  I generated 
a binary  random  variable  with  the  above  specified  probability  structure  and 
conditional  on  the  fact  that  the  simulated  value  for  Du  is  a 1 (i.e. , the  subject  is 
a case)  I generated  a binary  exposure  variable  ) from  a Bernoulli  distribution 
with  success  probability  p\  where 


Pi  = 77iy  exp(6»|11)  + P21 ) { 1 + exp(6>|2)  + \{1  + /?22)}. 

°»1 

Now  the  second  binary  exposure  variable  (A^)is  generated  from  a Bernoulli 
distribution  whose  success  probability  p2|i  depends  on  the  value  of  X , and  is 
given  by 


(i) 

il 


_ , _ exp{AllX<11)+e)?)+/322} 

^2|1  l+exp{AiiXf11)+(9l^)+/322} ' 

If  the  simulated  value  for  Du  is  0 (i.e.,  the  subject  is  a control)  I generate  X 
from  a Bernoulli  distribution  with  success  probability  pi  = exp(0-^){l  + exp(0l-2’1  + 
Xn)} /Cfy]  and  subsequently  X^  is  generated  from  a Bernoulli  distribution, 
conditioned  on  X^\  with  the  success  probability  p2\i  = exp{AiiX^  + 0|2)}/(  1 + 
exp{AiiAlj11'1  + 9 -2^}).  Once  I have  simulated  Du  in  the  above  manner,  for  the  other 
observation  in  the  ith  stratum,  Da  = 1 — Du.  The  corresponding  exposure  variables 
are  then  generated  in  the  above  fashion. 
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I performed  two  sets  of  simulations:  one  when  exposure  variables  are  associ- 
ated, and  the  other  when  they  are  independent  (i.e. , the  log-odds  ratio  parameter 
between  the  two  exposure  variable  Ay  = 0.00  Vi,  j).  For  the  Bayesian  methods, 
(i.e.,  AM  and  IM),  I used  Normal(0, 10)  prior  for  each  of  the  parameters.  I repli- 
cated the  simulation  50  times,  generating  50  different  data  sets  and  obtained  the 
parameter  estimates  by  the  above  three  methods.  For  each  replication  I created 
missing  data  by  randomly  deleting  20%  observations  from  each  of  the  exposure 
variables  and  recalculated  all  the  estimates  by  the  three  methods  (i.e,  AM,  IM,  and 
CLR). 

The  result  of  the  simulation  study  is  fairly  clear.  When  the  exposure  variables 
are  truly  associated  and  partially  missing,  AM  performs  much  better  than  IM 
and  CLR.  When  the  exposure  variables  are  not  associated,  AM  and  IM  perform 
comparably  and  both  outperform  CLR.  There  is  a clear  edge  of  AM  over  the  other 
two  methods.  The  relative  advantage  of  using  the  AM  increases  as  the  percentage 
of  missing  observation  increases  in  a dataset.  AM  allows  the  possibility  to  borrow 
information  on  missing  values  of  an  exposure  variable  from  the  other  observed 
components  of  the  exposure  variable  via  the  association  structure  and  also  through 
the  completely  observed  covariate.  But,  due  to  independence  assumption,  IM 
is  not  able  to  extract  information  on  the  missing  components  of  an  exposure 
variable  from  the  observed  components  of  the  exposure  variable,  it  can  only  use  the 
completely  observed  covariate  information.  Thus,  in  the  presence  of  association  and 
missingness  IM  is  less  efficient. 

5.4  Concluding  Remarks 

To  summarize  the  finding  of  this  Chapter  one  can  note  that  this  is  the  first 
attempt  to  account  for  multivariate  association  among  exposures  in  missing  data 
situations  in  a matched  case-control  study.  Through  my  examples  I find  that 
the  association  model  outperforms  the  independence  model  when  one  has  two 
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strongly  associated  exposures  and  an  informative  completely  observed  covariate. 

I also  present  a brief  discussion  of  treatment  of  missingness  and  measurement 
error  under  the  same  framework.  As  mentioned  in  the  introduction  this  model 
presents  a way  to  deal  with  categorical  and  continuous  exposures  simultaneously. 
The  work  presented  here  for  binary  exposures  can  be  extended  to  a general 
categorical  exposure  in  a straightforward  way.  The  model  also  accounts  for  stratum 
heterogeneity  on  exposure  distributions  through  a stratum-specific  intercept  term 
while  modeling  the  parameters  in  the  exposure  distribution.  Thus  this  Chapter 
offers  an  ensemble  of  statistical  methods  for  unorthodox  data  situations  in  a 
matched  case-control  study. 
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Table  5-1:  Results  of  the  endometrial  cancer  data  for  the  association  modelling 


Endometrial  Cancer  Data 

Method 

Estrogen- Use 

Gall-Bladder 

Obesity 

AM 

Mean 

2.18 

1.11 

0.67 

s.e. 

0.42 

0.40 

0.39 

Lower  HPD 

1.48 

0.35 

-0.24 

Upper  HPD 

3.15 

1.94 

1.40 

IM 

Mean 

2.23 

1.11 

0.65 

s.e. 

0.45 

0.41 

0.41 

Lower  HPD 

1.42 

0.35 

-0.20 

Upper  HPD 

3.20 

1.95 

1.43 

CLR 

Mean 

1.95 

1.26 

0.42 

s.e. 

0.50 

0.44 

0.40 

Lower  CL 

0.98 

0.41 

-0.36 

Upper  CL 

2.93 

2.11 

1.19 

With  30%  Artificial  Missingness 

on  Gall-Bladder 

Method 

Estrogen-Use 

Gall-Bladder 

Obesity 

AM 

Mean 

2.37 

1.31 

0.75 

s.e. 

0.45 

0.50 

0.40 

Lower  HPD 

1.56 

0.25 

-0.28 

Upper  HPD 

3.32 

2.42 

1.50 

IM 

Mean 

2.52 

1.70 

0.64 

s.e. 

0.49 

0.56 

0.42 

Lower  HPD 

0.66 

0.68 

-0.22 

Upper  HPD 

3.45 

2.86 

1.40 

CLR 

Mean 

1.69 

0.87 

0.07 

s.e. 

0.64 

0.58 

0.55 

Lower  CL 

0.45 

-0.27 

-1.00 

Upper  CL 

2.94 

2.01 

1.15 

Here  “Mean”  is  the  posterior  mean,  “s.e.”  is  the  posterior  standard  deviation, 
“Lower  HPD”  and  “Upper  HPD”  are  the  lower  and  upper  end  of  the  HPD  region 
respectively,  “Lower  CL”  and  “Upper  CL”  are  the  lower  and  upper  end  of  the 
confidence  limit  respectively.  There  is  natural  missingness  in  Obesity 
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Table  5-2:  Secondary  model  parameter  estimates  for  endometrial  cancer  data 


Parameter 

Association  Modeling 
Full  Data  Missing  Data 

Independence  Modeling 
Full  Data  Missing  Data 

7n 

Mean 

-1.40 

-1.81 

-1.17 

-2.07 

S.  E. 

0.45 

0.53 

0.35 

0.49 

712 

Mean 

0.19 

0.26 

0.22 

0.25 

S.E. 

0.30 

0.31 

0.30 

0.32 

7l3 

Mean 

0.44 

0.33 

S.E. 

0.53 

0.63 

Secondary  model  parameter  estimates  for  association  and  independence  modeling  of 
the  Endometrial  Cancer  Data  example.  Here  “Mean”  is  the  posterior  mean,  “S.E.” 
is  the  posterior  standard  deviation.  Missing  data  implies  the  data  with  30% 
artificial  missingness  on  the  gall-bladder  disease. 


Table  5-3:  Results  of  the  benign  breast  disease  data 


Benign  Breast  Disease 

Method 

Education 

Regular  Medical 

Age  at  First 

AM 

Checkups 

Pregnancy 

Mean 

-0.62 

1.47 

1.86 

s.e. 

0.43 

0.43 

0.50 

Lower  HPD 

-1.40 

0.65 

0.78 

Upper  HPD 

0.38 

2.42 

2.85 

IM 

Mean 

-0.38 

1.61 

1.50 

s.e. 

0.42 

0.43 

0.52 

Lower  HPD 

-1.20 

0.78 

0.44 

Upper  HPD 

0.40 

2.53 

2.47 

CLR 

Mean 

-0.41 

1.37 

1.31 

s.e. 

0.50 

0.49 

0.56 

Lower  CL 

-0.57 

0.41 

0.21 

Upper  CL 

1.39 

2.33 

2.41 

Here  “Mean”  is  the  posterior  mean,  “s.e.”  is  the  posterior  standard  deviation, 
“Lower  HPD”  and  “Upper  HPD”  are  the  lower  and  upper  end  of  the  HPD  region 
respectively,  “Lower  CL”  and  “Upper  CL”  are  the  lower  and  upper  end  of  the 

confidence  limit  respectively. 
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Table  5-4:  Results  of  the  simulation  study  for  the  associated  exposure 


Associated  Exposures:  Full  data 

Method 

Pi 

/?21 

P22 

AM 

Mean 

2.1486 

1.0235 

0.7138 

MSE 

0.1151 

0.1398 

0.0281 

IM 

Mean 

2.2420 

0.9370 

0.7046 

MSE 

0.2479 

0.2543 

0.1012 

CLR 

Mean 

2.3129 

1.1077 

0.7383 

MSE 

0.2171 

0.2251 

0.1682 

Associated  Exposures: 

Missing  Data 

Method 

Pi 

P21 

P22 

AM 

Mean 

2.3689 

1.0939 

0.6309 

MSE 

0.1628 

0.1395 

0.1194 

IM 

Mean 

2.4614 

1.0000 

0.8824 

MSE 

0.1828 

0.4813 

0.1295 

CLR 

Mean 

2.4409 

1.2744 

0.8069 

MSE 

0.4824 

0.8018 

0.5507 

Results  of  the  simulation  study  when  the  data  was  generated  with  association 
among  exposures  with  Ay  = 7031  + 0.44Zy , 703i  ~ Normal(— 1.40,  2.38).  Here 
;‘Mean”  is  the  mean  of  the  50  estimates  corresponding  to  the  50  simulated  datasets 
while  MSE  is  the  mean  squared  error.  The  true  parameter  values  are  f3\  — 2.18, 
P21  — 1.11  and  P22  = 0.67.  AM  and  IM  stand  for  association  and  independence 
modeling  respectively.  Missing  data  was  created  by  randomly  deleting  20%  of 
observations  on  and  X^2\ 
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Table  5-5:  Results  of  the  simulation  study  for  the  independent  exposures 


Independent  Exposures:  Full  data 

Method 

Pi 

P21 

P22 

AM 

Mean 

2.2769 

0.7913 

0.7700 

MSE 

0.2289 

0.1200 

0.0986 

IM 

Mean 

2.4165 

0.9710 

0.7609 

MSE 

0.1315 

0.1647 

0.0496 

CLR 

Mean 

2.3919 

1.2365 

0.6647 

MSE 

0.3277 

0.2097 

0.1960 

Independent  Exposures: 

Missing  Data 

Method 

Pi 

P21 

P22 

AM 

Mean 

2.3165 

0.8186 

0.8883 

MSE 

0.1426 

0.3189 

0.2380 

IM 

Mean 

2.4955 

0.9136 

0.6286 

MSE 

0.1812 

0.2433 

0.2162 

CLR 

Mean 

2.4559 

1.3303 

0.8878 

MSE 

0.4908 

0.5052 

0.6597 

Results  of  the  simulation  study  when  the  was  data  generated  with  independent 
exposures  with  log-odds  ratio  \j  = 0.00.  Here  “Mean”  is  the  mean  of  the  50 
estimates  corresponding  to  the  50  simulated  datasets  while  MSE  is  the  mean 
squared  error.  The  true  parameter  values  are  f3\  — 2.18,  P21  — 1-11  and  /322  = 0.67. 
AM  and  IM  stand  for  association  and  independence  modeling  respectively.  Missing 
data  was  created  by  randomly  deleting  20%  of  observations  on  X ^ and  X (2K 


CHAPTER  6 

SUMMARY  AND  DIRECTIONS  FOR  FUTURE  RESEARCH 

Case  control  studies  have  drawn  considerable  attention  from  epidemiologist  as 
well  as  from  statisticians.  In  this  dissertation  we  considered  the  problem  of  ana- 
lyzing matched  case-control  studies  with  partial  missing  observations  in  exposure 
variables  assuming  the  data  are  missing  at  random.  To  handle  the  missingness 
I proposed  a full  likelihood  based  approach.  The  novelties  of  my  approach  lies 
in  capturing  the  unmeasured  stratum  heterogeneity  on  the  distribution  of  the 
exposure  variables.  I estimate  the  model  parameters  in  a semiparametric  Bayesian 
framework.  This  is  the  first  unified  Bayesian  approach  in  matched  case-control 
studies  which  can  handle  multiple  disease  states,  missing  exposure  variables  and 
the  measurement  errors  in  the  exposure  variables.  I have  also  considered  modeling 
association  among  multiple  exposures  in  a matched  case-control  setup  when  some 
of  the  exposure  variables  are  partially  missing. 

There  are  many  possible  extensions  of  my  proposed  methodology  for  analyzing 
case-control  data.  First,  one  can  extend  the  proposed  method  of  handling  missing 
data  to  nested  case-control,  case-cohort  (Prentice  1986),  and  population-based 
case-control  study  (Benichou  and  Gail,  1995;  Benichou  and  Wacholder,  1994). 
Second,  in  my  research  I assumed  presence  of  a completely  observed  covariate(s)  in 
the  datasets  and  modeled  the  distribution  of  the  exposure  variables  in  terms  of  the 
former.  One  can  also  investigate  how  to  handle  the  problem  of  missing  exposure 
in  the  absence  of  such  completely  observed  covariate(s).  Chatterjee  et  al.  (2003) 
considered  a similar  problem  for  unmatched  case-control  studies.  They  used  a 
semiparametric  method  following  the  approach  of  Lawless  et  al.  (1999).  But  the 
analogous  approach  in  the  context  of  matched  case-control  studies  is  still  unknown. 
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Another  potential  problem  is  to  handle  nonignorable  missing  values  in  a 
exposure  variable  in  a case-control  studies.  Chapter  3 deals  with  missing  exposure 
variable  in  matched  case-control  studies  with  multiple  disease  states.  Disease  can 
be  classified  on  the  basis  of  various  pathological  information  on  a subject.  But 
all  pathological  information  may  not  be  observed  for  all  the  study  subjects  and 
consequently  one  faces  the  problem  of  missing  information  on  the  disease  states 
which  is  not  yet  addressed  in  the  literature. 

Another  possible  area  of  research  is  to  adapt  my  methods  for  analyzing  sur- 
vival data.  Some  work  in  this  regard  has  been  initiated  in  a frequentist  framework 
by  Prentice  and  Breslow  (1978),  but  Bayesian  inference  for  such  problems  seems  to 
be  largely  unexplored.  For  example  the  methods  could  potentially  be  adapted  to 
bivariate  survival  models  in  family  based  study  design  (see  Oakes  1986,  1989  and 
Shih  and  Chatterjee,  2002).  Finally,  genetic  case-control  study  is  a new  emerging 
area  of  research  where  one  of  the  objectives  is  to  study  the  association  between 
a candidate  gene  and  a disease.  Accounting  for  population  substructure  through 
varying  stratum  effects  is  of  particular  relevance  in  such  studies. 

Bayesian  approach  could  also  be  useful  to  analyze  data  coming  from  other 
designs  used  in  an  epidemiologic  context  (Khoury,  Beaty  and  Cohen,  1993).  For 
example,  another  frequently  used  design  is  a nested  case-control  design  (Wacholder 
1996)  where  a case-control  study  is  nested  within  a cohort  study.  This  type  of 
study  is  useful  to  reduce  the  cost  and  labor  involved  in  collecting  the  data  on  all 
individuals  in  the  cohort  as  well  as  to  reduce  computational  burden  associated 
with  time-dependent  explanatory  variable.  Unlike  the  case-control  design  this 
design  allows  us  to  estimate  the  absolute  risk  of  the  disease  by  knowing  the  crude 
disease  rate  from  the  cohort  study.  Some  other  study  designs  along  this  line  are 
the  proband  design  (Gail,  Pee  and  Carroll,  2001)  and  case-only  design  (Armstrong 
2003). 
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It  is  important  to  note  that  case-control  designs  are  choice-based  sampling 
designs  in  which  the  population  is  stratified  on  the  values  of  the  response  variable 
itself.  Among  others,  this  was  noticed  in  Scott  and  Wild  (1986)  who  compared  in 
such  cases  a maximum  likelihood  based  approach  with  some  other  ad  hoc  methods 
of  estimation.  Breslow  and  Cain  (1988)  considered  in  such  cases  a two-phase 
sampling  design.  In  later  years,  there  is  a long  series  of  publications  on  inference 
for  such  designs,  but  any  Bayesian  approach  to  these  problems  is  still  lacking,  and 
will  be  a worthwhile  future  undertaking. 
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